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INTRODUCTION 


When  we  first  agreed  in  mid- 1989  that  a  need  existed  for  an  applied  (practical)  statistics  workshop, 
it  hardly  occurred  to  us  that  as  many  as  63  other  people  in  Alberta  would  concur!  Yet,  that  was  the 
final  attendance  at  the  First  All-Alberta  Apphed  Statistics  and  Biometrics  Workshop,  October  18- 
19,  1990,  at  the  Alberta  Research  Council,  Edmonton  (Millwoods  facihty).  We  believe  these 
Proceedings  will  reflect  the  fine  participation  and  diversity  of  interests  exhibited  by  the  speakers 
and  those  who  attended. 

We  are  fortunate  to  have  received  the  support  of  the  Alberta  Agricultural  Research  Institute 
(AARI)  which  granted  us  fiinds  through  the  Research  Coordination  Grants  program;  our  sincerest 
appreciation  to  the  AARI  Board  for  deeming  this  project  worthy  of  support  in  1990.  We  are  most 
pleased  to  say  that  the  Board  has  approved  our  application  and  agreed  to  renew  our  funding  in  1991. 
Plans  are  well  underway  for  the  next  workshop  to  be  held  in  Edmonton,  October  21-22,  1991. 

In  addition  to  the  list  of  participants  in  Appendix  B  and  the  speakers  featured  in  these  Proceedings, 
we  are  indebted  to  many  others  who  got  behind  this  project  and  proceeded  to  help  make  it  happen. 

Thanks  to  Serge  Dupuis,  Bob  Hardin  and  Keith  Toogood  for  serving  on  the  ad  hoc  Program 
Planning  Committee. 

Arhlene  Hrynyk,  her  staff,  and  the  late  Don  Klick,  in  the  Animal  Sciences  Division,  Alberta 
Environmental  Centre  (AEC),  were  most  helpfiil  during  the  early  stages  by  preparing  mailing 
lists,  survey  forms  and  correspondence.  Special  appreciation  to  Phil  Henry  who  lent  his  ideas, 
setup  the  computer  aided  registration  on  the  Macintosh  and  created  our  "Z  -Alberta"  logo,  among 
other  graphics.  Sincere  appreciation  as  well  to  Janet  Smalley  and  the  support  staff  of  the  Beef 
Cattle  and  Sheep  Branch  of  Alberta  Agriculture  for  their  kind  assistance  during  registration. 

Many  thanks  go  to  Marilyn  Florence  who  volunteered  her  assistance  during  preparation  of  the 
program  packets  and  these  Proceedings. 

Robert  Heimann,  Alberta  Research  Coimcil,  suggested  our  having  the  workshop  at  the  Edmonton 
(Millwoods)  facility  and  kindly  Eigreed  to  act  as  official  local  host. 

Finally,  we  are  particularly  pleased  to  acknowledge  the  support  we  have  each  received  from  our 
respective  directors  and  branch  heads,  and  the  Departments  of  Environment  and  Agriculture. 

Notwithstanding  the  fine  suggestions  for  topics  from  the  Program  Planning  Committee,  the 
workshop  had  no  theme;  our  attempt  to  have  a  good  cross  section  of  applied  statistical  topics,  we 
believe,  is  demonstrated  by  the  contents  of  these  Proceedings. 


L.  Zack  Florence 

Alberta  Environmental  Centre 

Laki  Goonewardene 
Alberta  Agriculture 
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APPLYING  STATISTICS  TO  PRACTICAL  PROBLEMS 
Milton  Weiss,  Consultant 
Pincher  Creek,  Alberta 


Applied  statistics  in  agriculture  or  biometrics  had  its  start  about  the  end  of  WWI  and  early  1920s 
when  R.A  Fisher,  Bewail  Wright,  and  L.  J.  Lush  first  published  on  the  application  of 
mathematics  and  probability  theory  to  animal  populations.  Various  other  people  at  about  the  same 
time  or  soon  after  began  applying  these  same  principles  to  designed  situations  in  field  trials. 
Yates,  Fairfield  Smith,  Haldane,  Student,  Bartlett,  K.  Pearson,  and  others  laid  a  solid  foundation 
for  further  advances  in  designed  experiments  in  agronomy  as  well  as  quantitative  genetics. 

The  interruption  by  WWII  did  not  slow  this  advance,  it  more  aptly  put  it  on  simmer  since,  when 
hostilities  ended,  a  large  number  of  young  men  and  women  returned  to  imiversities  to  develop 
careers.  These  numbers  were  greatly  enhanced  by  the  aid  of  various  G.I.  bills  which  made  it 
possible  for  many  to  attend  university  who  would  otherwise  have  been  denied.  This  influence 
along  with  greatly  increased  government  support  for  research  and  development  had  an 
immediate  and  profound  influence  on  applied  statistics. 

There  were  many  timely  and  extremely  well-written  texts  published  in  the  15-20  years  following 
WWII.  More  importantly,  a  great  number  of  well-qualified  statisticians  turned  to  teaching  which 
allowed  many  universities  in  North  America  to  establish  statistics  as  a  valid  course  in  either  a 
mathematics  department  or  more  importantly  to  establish  biometrics  or  applied  statistics  as  a  field 
of  instruction  in  agriculture  and/or  biology  faculties. 

This  has  led  to  the  position  we  are  in  today,  where  we  can  classify  our  applied  statisticians  in 
various  ways,  but,  for  simplicity,  we  will  break  them  into  two  groups: 

1.  A  statistician  with  some  background  and  experience  in  the  subject  matter  field  he 
is  serving. 

2.  A  subject  matter  specialist  with  some  interest  and  training  in  applied  statistics, 
particularly  with  those  techniques  or  tools  he  uses  in  his  day-  to-day  work. 

A  third  approach,  and  one  that  I  have  had  a  lot  of  experience  in,  and  enjoyment  and  satisfaction 
from  is  where  a  statistician  and  a  subject  matter  specialist  work  jointly  or  co-operatively  on  a 
project. 
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It  is  highly  desirable  (imperative  ?)  that  the  statistician  be  involved  in  a  project  from  the  start. 
Once  a  project  is  deemed  necessary  or  worth  carrying-out,  the  statistician  should  be  made  a  part  of 
the  project  team  immediately.  This  will  often  ensure  that  the  design  has  the  capability,  if  carried 
out  properly,  of  answering  the  questions  posed  by  the  project  i.e.  that  the  specified  goals  can  be  met. 
The  statistician  should  help  with  the  data  collection  protocol  and  definitely  has  valuable  inputs 
with  respect  to  data  edits  and  validity  checks.  The  analysis  to  be  used  should  be  laid  out  or  specified 
at  this  time,  complete  with  tests  of  hypotheses,  linear  comparisons,  orthogonal  contrasts,  and  so  on. 

Once  the  project  has  progressed  to  the  stage  of  analysis  and  interpretation,  the  statistician  again 
becomes  vital  particularly  in  the  areas  of  interpretation  and  drawing  of  valid  conclusions.  The 
reporting  of  statistical  design,  models  used,  and  analysis  procedures  as  well  as  certain  parts  of  the 
results  section,  are  best  handled  by  the  statistician  or  co-operatively  between  the  subject  matter 
specialist  and  the  statistician.  An  area  which  is  becoming  less  of  a  problem  with  more  people 
trained  in  and/or  appreciative  of  statistics  is  the  role  of  referee  or  arbitrator  between  the  author,  the 
editor,  and/or  the  reviewers  on  statistical  matters.  The  applied  statistician  also  can  provide  a 
useful  role  in  technology  transfer  -  boiling  the  whole  process  down  so  that  it  is  meaningful  to  the 
end  users  —  not  just  to  other  specialists  in  the  subject  matter  field. 

When  I  first  became  interested  in  statistics  in  the  mid  1950s  there  was  one  mechanical  desk 
calculator  for  a  whole  class  to  use  (and  it  usually  didn't  work).  I  soon  became  very  proficient  in 
hand  calculations  with  a  pencil,  paper,  sliderule,  and  a  very  large  eraser.  I  was  fortunate  to  have 
the  opportunity  to  use  the  Mystic  at  Michig£in  State  University  and  the  Cyclone  at  Iowa  State 
University  before  they  became  museum  pieces.  In  1961 1  had  my  initiation  to  IBM  and  the  then 
popular  650,  soon  to  be  replaced  in  our  work  by  the  1620  in  early  1962.  Overlapping  with  this 
experience  were  the  wired  board  (402,  407,  602,  etc.)  machines.  Instead  of  programming  them,  you 
wired  all  the  instructions  one  digit  at  a  time  into  boards  and  then  ran  applications  from  punched 
cards. 

With  the  1620  we  really  started  to  move  forward  -  every  shop  acquired  a  numerical  analyst  and  one 
or  more  programmers.  It  was  the  "in  thing"  to  learn  programming;  machine  language,  symbolic 
languages  and  then  EUREKA!!  we  got  compilers  for  Fortran  and  Cobol.  However,  computing  was 
so  slow  and  expensive  that  you  spent  days  and  even  weeks  desk  checking,  debugging  and 
correcting  machine  language  compiled  card  decks  rather  than  go  on  the  computer  and  chance 
blowing  the  budget.  Every  shop  was  developing  programs  and  exchanging  with  each  other,  often 
with  great  duplication  of  effort.  In  short,  there  was  little  time  for  the  real  job  of  applying  statistics 
to  user  problems. 
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The  next  big  step  was  commercial  statistical  analysis  packages  -  BMD,  SPSS,  and  later  SAS.  This 
was  fantasy  land.  Not  only  could  we  get  the  computer  to  work  for  us  -  the  instructions  involved 
were  simple  and  easy  to  follow  -  too  easy  in  some  cases.  Along  with  these  new  packages  we  were 
getting  better  access  to  bigger,  faster,  and  less  expensive  computers.  What  used  to  take  days  and 
even  weeks  to  accomplish  can  now  be  done  in  minutes,  right  from  your  own  desk. 

Now  we  not  only  have  terminals  but  also  micro  computers  that  for  a  very  modest  price  can  do 
virtually  everything  we  need  right  at  hand.  SAS  and  many  other  statistical  packages  are 
available  (at  a  fairly  modest  cost)  for  micro-computers. 

A  note  of  caution  -  we  can  do  what  we  want  fast  and  efficiently  in  so  far  as  statistics  is  concerned. 
However,  many  times  we  forget  to  apply  our  statistical  knowledge  to  problems.  You  rarely  hear 
mention  of  such  topics  as  fixed,  mixed,  or  random  models,  residual  analysis,  and  so  on.  Yet  these 
are  easily  accommodated,  at  most  requiring  only  a  little  juggling  of  the  degrees  of  freedom  (df) 
and  corresponding  sums  of  squares  (ss)  terms  in  the  ANOVA  table.  The  tests  of  significance  and 
comparisons  may  or  may  not  require  extra  work  beyond  this  first  manipulation  as  many 
packages  allow  the  user  to  specify  the  error  term  to  be  used. 

Personally  I  like  the  present  situation  - 1  can  now  accomplish  in  one  relaxing  evening  what  used 
to  take  six  months  (with  the  help  of  3  clerks  on  calculators)  of  hard  work  to  achieve. 
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Time  Series  Models  in  Econometrics 

Vic  Adamowicz 
Associate  Professor 
Department  of  Rural  Economy 
University  of  Alberta 
Edmonton,  Alberta 

Economists  have  struggled  with  attempts  to  model  the  structure  of  the  economy  for  decades.  They 
have  concentrated  on  so-called  structural  models  of  demand  and  supply,  interest  rates  and  money  supply, 
exchange  rates  and  export  quantities,  and  a  variety  of  other  relations  which  arise  from  economic  theory.  The 
problem  for  the  statistical  modeler,  however,  is  that  these  relationships  cannot  be  isolated  from  all  the  other 
aspects  of  the  economy  that  function  around  them.  We  cannot  stop  the  world  to  examine  the  relationship 
between  exports  of  wheat  and  the  Canada-U.S.  exchange  rate.  A  myriad  of  other  influences,  including  the 
influence  of  time,  are  creating  a  smoke  screen. 

The  challenge  for  economists,  or  more  particularly  econometricians,  is  to  find  a  fan  to  blow  the  smoke 
away,  so  that  we  can  see  the  correct  relationship.  In  most  of  the  physical  sciences,  the  "fan"  is  experimental 
design.  Controls  are  put  in  place  and  unwanted  smoke  is  kept  away  from  the  relationship  in  question.  In  the 
analysis  of  social  systems,  such  designs  are  not  possible.  The  search  for  a  good  "fan"  is  the  attempt  to  resolve 
the  "identification  problem." 

The  identification  problem  is  a  statistical  technicality  that  arises  when  the  parameters  of  the  underlying 
structural  model  cannot  be  uniquely  "identified"  from  the  data.  The  classic  example  is  that  of  supply  and 
demand.  A  supply  curve  traces  out  the  response  of  sellers,  in  terms  of  the  quantity  they  wish  to  sell,  to 
changes  in  the  market  price.  A  demand  curve  traces  out  the  response  of  buyers  to  these  same  market  prices. 
The  data  we  gather  from  a  market,  however,  are  only  the  equilibrium  quantities  and  prices  (the  prices  and 
quantities  agreed  on  by  the  buyers  and  sellers  in  that  period).  An  estimation  of  the  relationship  between  price 
and  quantity  reveals  a  mix  of  supply  and  demand  factors  and  not  a  unique  identification  of  either  set  of 
underlying  structural  parameters.  If  we  have  data  on  a  number  of  periods  of  equilibrium,  we  still  cannot  sort 
out  whether  we  are  watching  demand,  supply  or  some  temporal  influence  on  the  price-quantity  data. 

Economists  have  chosen  a  number  of  approaches  to  modeling  or  forecasting.  Three  forms  of  modeling 
will  be  discussed  in  this  paper;  the  traditional  structural  approach  to  modeling  economic  systems  over  time, 
the  time  series  approach  and  the  vector  autoregressive  approach.  The  three  approaches  are  quite  different  in 
their  statistical  methods  and  their  method.of  addressing  the  identification  problem. 
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Structural  Econometric  Modeling 

Structural  econometric  models  use  time  series  data  to  estimate  the  relationships  between  economic 
variables.  The  statistical  techniques  are  modifications  of  multivariate  analysis.  Using  the  supply-demand 
example,  a  structural  model  would  theoretically  assign  a  particular  form  to  the  demand  relation  and  the  supply 
relation.  For  example,  both  quantities  would  be  a  function  of  price  but  demand  may  also  be  a  function  of 
income  while  supply  may  be  a  function  of  weather  factors.  Typically  there  are  few  explicit  dynamic  elements  in 
the  equations:  quantities  today  are  expressed  as  functions  of  prices  and  other  factors  in  the  current  time 
period. 

The  modeling  process  described  above  requires  a  strong  theoretical  base  to  provide  the  specification  of 
the  equations.  The  resulting  statistical  model  has  the  merit  that  it  is  based  on  theory  and  should  provide  more 
than  just  "correlations".  However,  more  works  needs  to  be  done  to  identify  the  parameters  of  each  function. 
Additional  restrictions  must  be  placed  on  the  model  to  be  able  to  isolate  the  demand  function  from  the  supply 
function. 

In  order  to  "identify"  the  demand  function  from  the  supply  function,  a  very  particular  set  of  restrictions 
is  used.  The  restrictions  must  produce  a  model  where  there  are  enough  exogenous  variables  (variables  not 
determined  within  the  system)  in  each  equation  to  allow  identification.  For  example,  weather  is  an  exogenous 
factor.  If  weather  only  affects  the  supply  curve,  these  changes  in  the  supply  curve  will  allow  us  to  trace  out  all 
the  equilibrium  prices  and  quantities  that  occur  along  a  single  demand  curve.  The  exogenous  variable  in  the 
supply  curve  identifies  the  demand  curve.  Similarly,  there  must  be  exogenous  variables  in  the  demand  curve 
which  help  identify  the  supply  curve.  In  more  complex  models  the  solution  to  this  search  for  exogenous 
variables  can  become  quite  difficult.  The  implication  for  modeling  is  that  if  one  wants  to  be  able  to  identify 
the  underlying  structural  equations  in  their  model,  they  must  exclude  some  factors  from  the  equations.  These 
restrictions,  while  solving  the  identification  problem,  lead  to  questions  about  appropriate  specification  of  the 
models.  The  estimation  of  these  models,  even  after  the  identification  problems  has  been  solved,  involves  one 
of  many  variants  of  multivariate  regressions  analysis.  These  estimation  techniques  include  Two  Stage  Least 
Squares,  Three  Stage  Least  Squares  and  a  variety  of  maximum  likelihood  based  approaches  (see  Judge,  et  al, 
1988). 

The  structural  econometric  approach  to  time  series  analysis  has  been  criticized  for  a  number  of  reasons. 
First,  the  models  tend  not  to  forecast  very  well.  A  variety  of  reasons  have  been  suggested  for  the  poor 
forecasting  performance  including  the  fact  that  little  dynamic  influence  is  included  in  these  models. 
Nevertheless,  structural  modelers  have  maintained  that  they  are  attempting  to  adhere  to  theory  and  they 
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suggest  that  empirical  analysis  without  this  formal  theory  is  vacuous.  A  second  criticism  of  structural  models 
is  that  they  tend  to  become  rather  large  and  expensive  to  run.  An  example  of  a  large  structural  model  is 
Agriculture  Canada's  FARM  model  which  contains  over  1,000  equations  in  a  number  of  sectors. 

One  of  the  most  scathing  critiques  of  structural  models  came  from  Robert  Lucas  (1976).  The  so  called 
"Lucas  Critique"  is  based  on  the  notion  that  the  correct  theoretical  model  depends,  at  least  in  part,  on  the 
current  policy  scenario.  A  change  in  the  policy  situation  requires  a  reformulation  of  the  parameters  of  the 
theoretical  model.  In  forecasting  the  impact  of  a  policy  change,  however,  structural  modelers  leave  the 
coefficients  intact  and  adjust  the  exogenous  variables.  Lucas  argued  that  these  structural  models  will 
undoubtedly  provide  poor  forecasts  of  policy  shocks.  Another  criticism  related  to  the  coefficients  of 
structural  models  is  attributed  to  Sims  (1980).  Sims  argues  that  in  attempting  to  identify  traditional  models, 
overly  restrictive  assumptions  are  likely  to  be  used,  resulting  in  poor  models. 

The  criticisms  of  structural  models  led  to  a  wave  of  other  models  which  were  designed  for  forecasting 
purposes.  These  models,  which  focused  on  the  time  series  elements  in  economic  data,  provide  a  contrast  to 
the  traditional  models,  not  only  by  their  emphasis  on  the  temporal  dimension,  but  also  on  their  lack  of  explicit 
theoretical  base. 
Time  Series  Modeling 

Simple  time  series  models  concentrate  on  explaining  the  data  as  a  stochastic  process.  The  emphasis  is 
on  the  temporal  structure  of  the  data  series.  The  data  are  first  examined  for  their  structure  over  time  or 
degree  of  stationarity.  Most  simple  time  series  techniques  assume  that  the  data  are  covariance  stationary 
stochastic  processes  (see  Judge,  et  al,  1988).  Given  covariance  stationarity,  the  series  can  be  modeled  as  an 
"autoregressive  process"  or  a  "moving  average"  process.  An  example  of  the  former  is  a  first  order 
autoregressive  process  (ARl)  or 

[1]  Q,  =  e(2,-,*e, 

A  first  order  moving  average  process  (MAI)  can  be  written  as 

[2]  Q,    =  £,-<!>£,-,. 
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These  simple  time  series  representations  focus  on  past  value  of  the  series  itself  or  the  error  process  for 
explanations  of  the  present  observation.  A  more  general  time  series  representation  can  be  specified  as 
[3]  0(Z)Q,  =  4>(i:)e, 

where  9  ( L  )and  4»(  /:  )are  polynomials  in  the  lag  operator.  1 

Once  the  structure  of  the  stochastic  process  is  determined,  a  model  can  be  estimated  and  the  process 
can  be  used  for  forecasting.  A  common  approach  used  to  determine  the  structure  of  time  series  models  is  the 
Box-Jenkins  approach.  The  Box- Jenkins  approach  involves  three  steps,  (1)  Identification,  (2)  Estimation  and 
(3)  Diagnostic  checking  (see  Judge,  et  al,  1988,  chapter  16).  Identification  includes  determining  if  the  process 
is  covariance  stationary  and  using  plots  of  the  autocorrelation  and  partial  autocorrelation  function  to 
determine  the  order  of  the  process.  The  process  may  be  autoregressive,  moving  average  or  it  may  be  a  mixture 
of  the  two.  If  a  model  is  not  stationary,  differencing  is  usually  used  to  remove  the  trend  from  the  data  and 
reduce  it  to  stationarity.  This  differencing  process  is  also  called  integration.  Thus  a  common  name  for  these 
models,  ARIMA,  stems  from  the  Autoregressive,  Integrated,  and/or  Moving  Average  components  of  the  data. 

Estimation  of  these  models  can  proceed  in  a  number  of  ways.  The  autoregressive  models  can  be 
estimated  using  OLS  but  a  number  of  other  methods  are  typically  used  to  increase  efficiency.  The  moving 
average  models  must  be  estimated  using  nonlinear  least  squares  as  the  use  of  lagged  values  of  the  error  term 
in  the  model  requires  iteration  around  an  initial  condition.  There  are  two  types  of  diagnostics  used  to  check 
these  models.  First,  the  coefficients  are  examined  to  ensure  that  they  do  not  produce  explosive  processes.  For 
example,  if  a  first  order  autoregressive  model  has  a  coefficient  that  exceeds  unity,  the  process  is  not  stationary. 
Second,  the  errors  of  the  models  are  checked  to  determine  if  they  correspond  to  white  noise  or  a  gaussian 
mean  zero,  identically  distributed,  independent  process. 

Additional  issues  in  ARIMA  modeling  include  modification  of  the  process  for  seasonality  effects  and 
the  use  of  transfer  functions.  Seasonality  is  modeled  by  including  lags  corresponding  to  the  seasonal  pattern 
in  the  data.  Transfer  functions  are  somewhat  similar  to  dummy  variables  in  regression  analysis.  They  indicate 
a  break  point  in  the  data  and  a  change  in  coefficient  values. 

The  emphasis  in  ARIMA  models  is  the  time  series  structure  of  the  data.  Cause  and  effect  relationships 
are  ignored  in  favour  of  the  temporal  dimension.  This  emphasis  on  data  analysis  leaves  most  researchers  who 
believe  in  theory  first  and  models  second  somewhat  disappointed.  One  can  forecast  with  ARIMA  models  but 


I  The  symbol  L  is  the  lag  operator,  ie.  LYt  =  Yt-i-  The  expression  e( I )X, describes 

9  0  A'  (  +  9  I  A' ,  - 1  +  9  2    ( -  2  •  •  • 
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the  effect  of  a  policy  change,  unless  it  is  modeled  with  a  transfer  function,  cannot  be  simulated.  However,  even 
though  ARIMA  models  do  not  rely  on  theoretical  specifications,  they  tend  to  forecast  very  well,  especially  in 
the  short  term.  Numerous  studies  of  forecasting  performance  find  ARIMA  models  to  be  at  least  as  good  as 
simple  econometric  models  and  they  are  usually  simpler  to  build  and  less  expensive  to  estimate.  The  lack  of 
theoretical  base,  however,  leaves  these  models  open  to  attack.  The  next  set  of  models,  VAR  models, 
incorporate  the  time  series  elements  of  ARIMA  models  and  the  structural  econometric  elements  of 
traditional  models. 
Vector  Autoregression  Models^ 

A  Vector  Autoregression  Model  (VAR)  is  essentially  a  dynamic  simultaneous  equation  system.  The 
dependent  variables  are,  by  definition,  all  endogenous  variables  and  the  independent  variables  are  lagged 
observations  of  all  variables  in  the  system^.  Each  equation  contains  the  time  series  structure  of  an  ARIMA 
model  with  all  variables  interacting  in  the  system.  A  VAR  model  imposes  very  few  a  priori  restrictions  on  the 
parameters  in  the  simultaneous  equation  system.  This  allows  the  data  to  provide  a  representation  of  the 
changes  in  the  system  without  the  "zero  restrictions"  required  in  traditional  simultaneous  equation  techniques. 
One  should  note,  however,  that  while  a  VAR  model  does  not  impose  zero  restrictions  on  the  parameters  in 
the  traditional  simultaneous  equation  fashion,  the  model  does  require  identification  restrictions  to  provide 
information  on  the  response  of  system  variables  to  shocks.  The  nature  of  these  restrictions  will  be  outlined 
below. 

The  VAR  approach  uses  the  set  of  lags  of  all  of  the  endogenous  variables  in  each  behavioral  equation 
as  the  reduced  form  or  statistical  model.  The  economic  structure  is  identified  using  the  variance  matrix  of  the 
residuals  to  place  identifying  restrictions  on  the  matrix  of  contemporaneous  coefficients.  In  VAR  models,  the 
statistical  model  is  developed  first  and  then  the  structural  model  is  identified.  This  is  opposite  to  the 
approach  followed  in  traditional  econometrics  and  is  favoured  by  some  statistical  theorists  (Spanos,  1989). 

While  both  the  VAR  approach  and  traditional  econometric  approaches  require  identification 
restrictions,  the  nature  of  these  restrictions  are  quite  different.  The  traditional  approach  uses  zero  restrictions 
on  parameters  for  identification  while  the  VAR  approach  uses  the  covariance  matrix  of  the  reduced  form 
residuals  and  the  assumption  of  orthogonal  behavioral  shocks  to  establish  identification.  Both  approaches 
may  be  used  to  study  responses  to  policy  shocks  (see  Mount,  1989;  Todd,  1989).  The  traditional  approaches 


2  This  section  is  based  on  the  discussion  in  Jennings,  et  al,  1991. 

3  Where  they  are  considered  to  be  important,  exogenous  (or  deterministic)  variables  may  be  included  in  the 
set  of  independent  variables  in  the  system. 
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tend  to  place  little  emphasis  on  lags  in  equations  while  the  VAR  approach  emphasizes  it.  The  traditional 
approach  places  strict  interpretations  on  the  parameters  of  each  equation  while  the  VAR  approach  interprets 
the  system  as  a  whole  and  the  analyzes  responses  to  the  behavioral  shocks  (see  Orden  and  Fackler,  1989). 
The  VAR  approach  begins  with  a  dynamic  equation  system  of  the  form 

[4]  yo-s)=  Xk^-o 

s-0  s-0 

where  Y(t)  and  v(t)  are  k  x  1  vectors  and  A(s)  is  a  k  x  k  matrix  of  coefficients  for  each  time  period  (s)  previous 
to  current  time  (t).  The  model  in  (4)  relates  the  observable  data  (Y)  to  sources  of  variation  in  the  economy 
(v).  The  shocks  in  v(t)  are  assumed  to  "represent  behaviorally  distinct  sources  of  variation  that  drive  the 
economy  over  time"  (Orden  and  Fackler,  1989,  p  496).  The  vector  v(t)  has  an  expected  value  of  zero  and  an 
assumed  diagonal  covariance  matrix,  D.  The  covariance  matrix  is  assumed  to  be  diagonal  so  that  individual 
shocks  (v(t))  apply  to  only  one  behavioral  equation  at  a  time.  Thus  we  can  evaluate  the  effect  of  shocks  to 
each  behavioral  equation  on  each  variable  in  the  system. 

Assuming  that  errors  from  previous  lags  do  not  affect  the  current  values,  equation  (1)  can  be  rewritten 
in  auto  regressive  form  as 

[5]  /i(0)y(o  =  -I/i(5)y(/-s)^KO 

s  -  1 

The  matrix  A(0)  is  the  set  of  contemporaneous  parameters  on  Y(t).  Multiplying  through  by  A(0)  inverse 
yields 

[6]  y(0=  t  Dis)Y(t-s)^u(t) 

s-  1 

vvhere  D(s)  =  -A(0)-1a(s)  and  u(t)=A(0)"lv(t).  The  vector  u(t)  is  the  one  step  ahead  prediction  error  in  Y(t) 
and  the  covariance  matrix  of  u(t)  is  I .  Equation  (6)  is  the  autoregressive  equation  which  is  estimated  given  an 
assumption  on  the  lag  length.  It  is  the  reduced  form  model. 

In  attempting  to  identify  the  effect  of  a  shock  to  a  behavioral  equation  on  the  variables  in  the  system  we 
can  use  the  coefficients  estimated  in  (6)  and  the  observed  error  to  simulate  the  impact.  Since  all  the  variables 
are  related  in  the  system  it  is  not  possible  to  "untangle"  the  effects  of  one  variable  on  another  using  the 
autoregressive  representation.  However,  the  autoregressive  representation  can  be  used  to  find  the  moving 
average  representation  which  expresses  the  level  of  a  particular  variable  as  a  function  of  the  error  process. 
From  the  moving  average  process  the  impact  of  the  behavioral  shocks  in  each  equation  on  each  other  variable 
can  be  identified. 
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The  moving  average  representation  of  (6)  can  be  written  as 


[7]  y(0=  Ic(s)/i(0)-'Ko 

s-O 

This  is  the  Impulse  Response  Function  (IRF)  which  describes  the  effect  of  shocks  to  the  behavioral  relations 
on  variables  in  the  system"*.  The  matrix  A(0)  contains  the  information  required  for  the  identification  of  the 
model.  Restrictions  on  A(0)  are  analgous  to  restrictions  on  coefficients  in  structural  modeling  (see  Orden 
and  Fackler  for  a  description  of  the  identification  of  VAR  models).  The  IRF  summarizes  the  dynamic 
multipliers  as  implied  by  our  identification.  A  shock  may  be  represented  by  the  placement  of  the  value  unity 
in  one  element  of  the  vector  v(t).  The  IRF  provides  the  response  of  all  variables  in  the  system  to  this  unit 
shock.  The  interpretation  of  these  shocks  is  analgous  to  the  interpretation  of  coefficients  in  a  structural 
model. 

Finally,  much  of  the  appeal  of  VAR  modeling  lies  in  the  fact  that  restrictions  on  the  parameters  of 
reduced  form  do  not  need  to  be  specified  a  priori.  Often,  however,  unrestricted  VAR  models  suffer  from 
overparameterization,  resulting  in  estimates  which  reflect  purely  random  fluctuations  in  the  data  and  not  the 
systematic  variation  which  we  are  interested  in  identifying.  Consequently,  estimated  variances  will  be  too 
large  and  will  produce  models  with  poor  forecasting  performance.  One  way  to  handle  this  problem  is  to  use 
Bayesian  prior  estimation  in  which  stochastic  restrictions,  in  the  form  of  prior  distribution  weights,  are 
applied  to  VAR  parameters  (see  Sims,  1986).  Bayesian  techniques  are  used  to  assign  weights  to  certain  lags  in 
the  system.  For  example,  financial  data  often  exhibit  random  walks,  or  coefficients  near  unity  on  the  first  lag 
and  zeros  elsewhere.  A  Bayesian  scheme  would  impose  a  prior  mean  of  unity  on  all  first  lags  of  the  dependent 
variable  and  a  prior  mean  of  zero  on  all  other  lags.  Mixed  estimation  is  used  to  impose  this  prior  information 
(Litierman,  1986). 

The  VAR  approach  incorporates  the  theoretical  elements  of  traditional  econometrics  and  the  lime 
series  elements  of  ARIMA  modeling.  Other  advantages  of  the  VAR  include  the  fact  that  model  development 
and  estimation  is  relatively  inexpensive  and  the  forecasting  accuracy  is  quite  good.  Litterman  (1986) 
compares  forecasts  from  several  large  econometric  models,  simple  time  series  models  and  a  Bayesian  VAR 


4  See  Judge  et  al.,  1988,  p.  771-775  for  an  illustration  of  the  derivation  and  use  of  impulse  response  functions 
or  innovation  accounting. 
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model.  The  results  of  a  forecasting  experiment  with  these  models  is  presented  in  Table  1.  The  Bayesian  VAR 
outperforms  the  other  models  in  Real  GNP  Growth  and  Inflation  prediction  for  most  periods  (as  measured 
using  mean  squared  error).  The  VAR  model  is  also  respectable  in  forecasts  of  Unemployment. 

The  forecasting  results  of  VAR  models  and  the  fact  that  policy  analysis  can  be  performed  with  them 
makes  these  tools  a  viable  alternative  to  structural  econometric  models  and  simple  time  series  models.  While 
a  variety  of  other  multivariate  time  series  models  exist  (Aoki,  1990)  the  VAR  approach  is  relatively  simple  to 
estimate,  relying  primarily  on  OLS  regression  results,  and  relatively  inexpensive  to  forecast  with.  The 
application  of  Bayesian  priors  to  VAR  analysis  allows  for  individual  researchers  to  factor  in  their  own  beliefs 
in  a  systematic  manner.  Such  beliefs  have  typically  been  imposed  in  an  ad  hoc  manner  in  structural  models. 
Conclusions 

This  paper  has  described  three  approaches  to  time  series  modeling  in  econometrics.  Traditional 
multivariate  structural  modeling,  ARIMA  modeling  and  VAR  modeling  have  been  outlined  as  competing 
•  chniques.  The  paper  has  concentrated  on  econometric  examples  but  these  models  are  applicable  to  a  wide 
range  of  topics,  including  those  in  the  physical  sciences.  This  discussion  has  also  been  centered  around  the 
ability  of  these  models  to  predict  well  and  to  fit  with  existing  theory.  The  identification  problem,  or  the 
"smoke"  caused  by  non-experimental  data,  requires  careful  theoretical  modeling  as  well  as  proper  statistical 
analysis.  Traditional  econometrics  has  used  theory  to  provide  the  functional  relationships  and  then  attempted 
to  conquer  the  statistical  modeling  problem.  Time  series  models  place  more  emphasis  on  the  statistical 
process  with  little  explicit  recognition  of  economic  theory.  VAR  approaches  try  to  combine  theoretical 
principles  with  a  time  series  statistical  component.  However,  they  address  the  identification  problem  after  the 
estimation  of  a  statistical  reduced  form  model.  While  the  VAR  results  seem  promising  one  must  emphasize 
that  VARs  are  really  an  alternative  rather  than  a  successor  to  structural  modeling.  The  VAR  technique  is 
another  kind  of  fan  used  to  blow  away  the  statistical  "smoke". 
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Table  1.  Mean  Squared  Error  Forecasts  1976:1  -  1979:4 


Forecast  Horizon:  Quarters  Ahead 


Variable 

9 

-I 

J 

o 

Real  GNP  Growth 

Z.  /ZO 

Z.oUl 

9  Q^l 

AivlJYLA. 

071 
3.U  / 1 

n7A 

1  fil 

J. Vol. 

Univariate  AR 

j.ojo 

DdycMdn  v/\xv 

Z.o*tl 

Z.i'^fo 

.j.UZl 

Inflation 

DRI§ 

1.605 

1.929 

2.277 

2.894 

2.727 

ARTMA 

1  007 

1  7SS 

9  911 

9  "^97 

vj 111 v<ii Idle  /vr\ 

'^111 

D.OiSJ 

'\  040 

Ravfcisin  \/Al? 
Dciy^oiciii  V  /^jx 

1  441 

1  710 

i.  /  lU 

1  640 

Unemployment 

DRI§ 

.341 

.449 

.485 

.494 

.430 

ARIMA 

.466 

.712 

.915 

1.073 

1.236 

Univariate  AR 

.362 

.493 

.541 

.566 

.576 

Bayesian  VAR 

.383 

.497 

.559 

.621 

.738 

§  DRI  structural  econometric  models  forecasts. 
Source:  Litterman,  1986 
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ABSTRACT 

This  paper  contains  two  parts.  The  first  part  summarizes  the  method- 
ology for  fitting  loglinear  models  to  three-dimensional  contingency  tables 
using  the  method  of  maximum  likelihood.  An  example  based  on  traffic 
accident  data  is  used  to  illustrate  the  techniques.  The  second  part  of  the 
paper  outUnes  the  logistic  regression  model  for  a  dichotomous  dependent 
variable.  An  example  based  on  female  labor  force  paxticipation  data  is  used 
to  demonstrate  the  methodology. 


OUTLINE 

1.  LogUnear  Models  for  Three-Dimensional  Contingency  Tables 

2.  Logistic  Regression 
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1.  Loglinear  Models  for  Three-Dimensional  Contingency  Ta- 
bles 


The  three-dimensionad  contingency  table  arises  from  the  cross-classification 
of  the  categories  associated  with  three  qualitative  rsindom  variables.  Ge- 
ometrically the  table  may  be  viewed  as  having  rows,  columns  and  layers. 
The  subscripts  for  the  rows,  colunms  and  layers  will  be  denoted  by  i,  j  and 
k  respectively.  The  number  of  rows,  columns  and  layers  will  be  denoted 
by  r,  c  and  i  respectively.  The  probability  density  for  cell  {ij,k)  will  be 
denoted  by  fijk  and  the  theoretic^  ceU  frequency  by  Fijk  =  n/.jjk  for  a 
total  table  frequency  of  n.  The  zJlocation  of  a  sample  of  size  n  to  the  totzd 
of  rc£  cells  yields  cell  frequencies  riijk.  Table  1  shows  the  riijk  for  a  sample 
of  size  n. 

Various  marginal  totals  will  be  denoted  using  dots  to  indicate  which  sub- 
scripts have  been  sununed.  For  the  three  possible  two-dimensional  tables, 
the  cell  frequencies  aie  denoted  by  the  marginals  riij,,  n^.i  and  n.jfc.  For 
each  of  the  three  variables  the  univciriate  marginals  are  given  by  n,-., ,  n.j. 
and  n„k- 

Table  1 .  A  Three  Dimensional  Contingency  Table 

Columns 


Layers 

Rows 

1 

2 

C 

1 

1 

"121 

••  "Icl 

2 

"221 

. .  "2cl 

r 

"r21 

•  •  "rd 

2 

1 

"122 

.  .  "ic2 

2 

"212 

"222 

..  n2c2 

r 

"rl2 

"r22 

•  •  "rc2 

i 

1 

"11/ 

"12/ 

••  "Ic/ 

2 

"21/ 

"22/ 

..  "2c/ 

r 

"rl/ 

"r2/ 

•  •  "rc/ 
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Table  2.  Frequency  Table  - 
Condition 


Driver  Injviry  Level  vs.  Seatbelt  Usage  and  Driver 


Driver  Injury  Level 
Driver        Seatbelt  Major/ 


Condition  Usage 

None 

Minimal 

Minor 

Fatal 

Totals 

XT  1  -ir  

INonnal  Yes 

12500 
(11817.8) 

604 
(697.1) 

344 
(450.2) 

38 
(51.8) 

13486 

No 

61971 
(62161.0) 

3519 
(3666.9) 

2272 
(2368.0) 

237 
(272.2) 

67999 

Totals 

74471 

4123 

2616 

275 

81485 

Been  Yes 
Drinking 

313 
(766.3) 

43 
(45.2) 

15 
(29.2) 

4 

(3.4) 

375 

No 

3992 
(4030.9) 

481 
(283.0) 

370 
(153.6) 

66 
(17.7) 

4909 

Totak 

4305 

524 

385 

7Q 

5284 

Totals  Both  Conditions 

78776 

4647 

3001 

345 

86769 

Example 

An  example  of  a  three-dimensional  table  is  presented  in  Table  2.  The  three- 
way  table  shows  the  relationships  between  extent  of  injury,  seatbelt  usage 
and  driver  condition  for  a  Scimple  of  86,769  auto  accidents. 

Models  for  Three-  Way  Tables 

We  begin  here  with  the  independence  model.  The  independence  model 
requires  that  the  joint  density  fijk  in  cell  (i,  j,  k)  be  equcil  to  the  product  of 
the  three  univariate  marginal  densities  fijk  =  fi..f.j.f..k'  The  theoretical 
frequency  for  a  total  frequency  of  n  is  given  by 

Fijk  =  nfijk  =  Fi.,F.j.F..k/n^ 
where  F,-..  =  n fi„ ,  F,j,  =  n f,j,  and  F„k  =  n f,.k . 
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Inference  for  the  Independence  Model 


Given  a  sample  of  size  n,  the  maximum  likelihood  estimators  of  the  ex- 
pected cell  frequencies  imder  the  independence  assumption  are  given  by 

Eijk  =  ni..n.j.n^k/n^    i  =  1, 2, . . . ,  r; 

i  =  l,2,...,c; 

The  fitted  cell  frequencies  depend  only  on  the  row,  column  and  layer 
marginals.  Using  the  estimated  expected  frequencies  Eijk,  ^^sts 
of  goodness  of  fit  for  the  independence  model  are  carried  out  using  either 
of  two  statistics 

C       /      /  r,  \2 
^2  _  ^  ^  ^  K^ijk  -  ^ijk) 

or 


t=i  j=i  k=i 

both  of  which  are  asymptotically  with  {rc£  —  r  —  £  —  c-{-2)  degrees 
of  freedom  if  the  independence  hypothesis  holds.  The  statistic  is  the 
Pearson      statistic,  while      is  derived  from  a  Ukelihood  ratio  test. 


Example 

The  x^  test  of  independence  for  Table  2  yields  1057.47  and  939.90  for  the 
Pearson  and  likelihood  ratio  statistics  respectively.  Both  of  these  statis- 
tics have  10  degrees  of  freedom  and  are  significant  at  the  0.000  level.  The 
expected  frequencies  under  the  independence  model  Eire  shown  in  Table  2  in 
brackets.  A  comparison  of  the  observed  and  expected  frequencies  permits 
us  to  conclude  the  following: 

(a)  For  seatbelt  users  who  appeared  normaJ,  the  number  of  accidents  re- 
sulting in  no  injury  was  larger  than  expected,  while  the  number  who 
sustained  any  injury  was  smaller  than  expected  under  independence. 

(b)  For  seatbelt  users  who  had  been  drinking,  the  number  of  accidents 
resulting  in  no  injury  was  less  than  half  the  number  expected  under 
independence.  In  the  minor  injury  category  there  were  fewer  cases 
than  expected. 
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(c)  For  non-users  of  seatbelts  who  appeared  normal,  the  number  of  acci- 
dents in  all  injury  categories  was  less  than  expected  under  indepen- 
dence. 

(d)  For  non-users  of  seatbelts  who  had  been  drinking,  the  number  of  ac- 
cidents resulting  in  no  injury  was  less  than  expected.  For  the  three 
injury  categories,  the  number  of  accidents  was  much  larger  than  ex- 
pected under  independence. 

Drivers  who  wore  seatbelts  and  appeared  normal  sustained  fewer  in- 
juries than  expected,  while  drivers  who  did  not  wear  seatbelts  and  had 
been  drinking  suffered  more  injuries  than  expected  under  independence. 
For  the  remaining  two  categories,  the  difference  between  the  observed  and 
expected  frequencies  seems  less  obvious.  A  logUnear  model  representation 
for  this  table  will  be  used  below  to  identify  the  interactions  among  the 
three  variables.  Before  attempting  to  model  the  variation  in  the  table,  a 
discussion  of  various  model  types  is  required. 

For  the  remainder  of  this  discussion  the  samphng  model  assumed  is  either 
multinomial  or  independent  Poisson.  The  two  distributions  eire  equivalent 
if  the  sample  size  n  is  fixed.  Because  the  product  multinomial  places  ad- 
ditional restrictions  on  some  marginab,  additional  requirements  must  be 
adhered  to  in  order  to  obtain  maximum  hkelihood  estimates. 

Partial  Independence 

Since  there  are  three  variables  in  the  table,  it  is  possible  to  have  two  waii- 
ables  related  to  each  other  but  both  be  independent  of  a  third  vairiable. 
This  model  is  called  the  partial  independence  model  and  is  given  by 

fijk  =  {fij.)iLk). 

In  this  case,  the  third  variable  with  subscript  Jk  is  independent  of  the  re- 
maining two  variables  with  subscripts  i  and  j.  The  theoretical  frequency 
is  given  by 

Fijk  =  Fij.F.,k/n 

and  is  estimated  by 

Eijk  =  nij,n„k/n. 

The  two-dimensional  marginals  n,j.  are  being  fitted  since  Eij,  =  n,j..  The 
goodness  of  fit  statistic  in  this  case  has  (rc—  1)  degrees  of  freedom. 
An  exaimple  of  a  partial  independence  relationship  would  exist  if  in  Ta- 
ble 2  seatbelt  usage  were  independent  of  both  driver  condition  and  driver 
injury  level,  but  at  the  same  time  driver  condition  and  injury  level  were 
related. 
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Conditional  Independence 

A  conditional  independence  model  permits  two  variables  to  be  indepen- 
dent cifter  controlling  for  a  third  variable.  An  example  of  such  a  model  is 
provided  by 

fijk  =  fi.kf.jk/f..k 

where  the  variables  with  subscripts  i  and  j  sure  independent  at  every  level 
of  the  variable  with  subscript  k.  The  theoretical  frequency  is  given  by 

Fijk  =  Fi.kF.jk/F„k 
and  the  maximum  likelihood  estimator  is  given  by 

For  this  model  the  two-dimensional  marginals  n,-.*  and  n.jk  are  being  fitted 
since  Ei,k  =  nj.jk  and  E,jk  =  n.jk-  The  goodness  of  fit  statistic  has 
£{r  —  l)(c  —  1)  degrees  of  freedom. 

An  example  of  a  conditional  independence  model  in  Table  2  would  occur 
if,  for  each  of  the  two  driver  conditions,  driver  injury  level  is  independent  of 
seatbelt  usage.  In  this  case  driver  injury  level  is  related  to  seatbelt  usage, 
but  if  driver  condition  is  held  fixed  then  seatbelt  usage  and  driver  injury 
level  are  independent.  In  other  words,  any  relationship  between  driver 
injury  level  and  seatbelt  usage  is  due  to  the  relation  between  driver  condi- 
tion and  the  other  two  variables.  This  result  is  similar  to  obtaining  a  zero 
first-order  partial  correlation  coefficient  with  three  quantitative  variables. 

No  Three-Way  Interaction 

The  next  step  in  moving  to  less  restrictive  models  is  to  assume  that  each 
pair  of  variables  is  related,  but  that  the  relation  between  any  pair  of  vari- 
ables does  not  depend  on  the  level  of  the  third.  This  model  is  usually 
referred  to  as  the  no  three-way  interaction  model.  It  is  not  possible  to 
given  cin  expression  for  fijk  or  for  Fijk  that  would  permit  us  to  determine 
the  estimators  Eijk  directly.  For  this  model  the  Eijk  are  obtained  by  a 
procedure  known  as  iterative  proportional  fitting. 

Since  the  model  to  be  fitted  assumes  that  all  possible  pairs  are  related 
but  that  there  is  no  three-way  interaction,  we  need  only  fit  a  model  which 
preserves  the  three  two-dimensional  marginal  totals  n,j.,  n,jk  and  n,-.jk. 
The  steps  for  iterative  proportional  fitting  proceed  as  follows: 

STEP  1:  Compute  the  observed  marginal  totals  njj,,  n,jk,  n,-.*. 

STEP  2:  Assign  the  initial  value  1  to  every  estimated  cell  frequency  i.e., 
i;[°^  =  l,foraUi,i,ik 
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STEP  3:  Compute  new  estimates  of  the  Eijk  so  that  they  sum  to  the 
marginal  totals  n,j.  using 

STEP  4:  Compute  new  estimates  of  the  Eijk  so  that  they  sum  to  the 
marginal  totals  ni,k  using 

STEP  5:  Compute  new  estimates  of  the  Eijk  so  that  they  sum  to  the 
marginal  totals  n,jk  using 

E^l  =  E^im\    for  all  U,.. 

STEP  6  and  subsequent  steps  —  repeat  the  cycle  given  by  Steps  3,  4 
and  5  until  the  changes  in  the  Eijk  are  smaller  than  some  preassigned 
number. 

For  the  fitted  model  the  three  two-dimensional  marginals  Eij.,  E,jk  and 
Ei.k  wiU  be  very  close  to  their  observed  counterparts  n,j.,  n,jk  and  n,-.jb. 
The  number  of  degrees  of  freedom  for  a  goodness  of  fit  test  would  in 
this  case  be  (r  -  l)(ib  -  l)(c  -  1). 

A  no  three-way  interaction  model  implies  that  the  interaction  between 
any  pair  does  not  depend  on  the  third  variable.  For  the  data  in  Table  2  a 
no  three-way  interaction  model  would  imply  that  the  interaction  between 
seatbelt  usage  and  driver  injury  level  does  not  depend  on  driver  condition. 
Similarly  the  interaction  between  driver  injury  level  and  driver  condition 
does  not  depend  on  seatbelt  usage,  and  the  interaction  between  seatbelt 
usage  and  driver  condition  does  not  depend  on  driver  injury  level. 

Saturated  Model 

As  in  the  case  of  the  two-way  contingency  table,  the  most  general  model  for 
the  three-way  contingency  table  is  the  saturated  model  that  fits  the  data 
perfectly.  The  saturated  model  for  the  three-way  table  includes  a  three-way 
interaction  which  allows  the  two-way  interaction  between  any  pair  to  vary 
at  each  level  of  the  third  variable.  This  model  will  be  discussed  further 
with  the  introduction  of  the  loghnear  model  for  three-way  tables  below. 
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Loglinear  Models  for  Three-way  Tables 

The  saturated  model  for  the  three-way  table  is  given  by 

In  Fiji  =  /i  +  +  fi2(j)  +  /i3(Jb)  +  /^12(ij)  +  Ml3(tJk)  +  f^23(Jk)  +  /il23(»j i). 

1  =  1, 2,. ..,r;    i  =  1.2,...,c;    A:  =  l,2,...,i 
where  Fijk  =  true  frequency  in  cell        A:)  and 

»=i  j=i  fc=i 

C  / 

i=i  ik=l 

r  < 

»=i  k=i 

r  c 

M3(jb)  =  51  -  ^' 

1=1  j=i 

/ii2(.j)  =  l/^^lnF.jjfc  -        -  /i20)  -  /i, 

c 

/il3(ifc)  =  l/c^lnFjjfc  -  -  /i3(ib)  -  /i, 

i=i 

r 

/*230*)  =  1/r^ln  Fijk  -  M20)  -  P3(fc)  -  /i, 
t=i 

/*123(ij*)  =  In  Fijk  -  -  H2U)  -  t^3{k)  -  Pl2(ii), 

-  /*230*)  -  /^13(it)  -  /i. 
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The  following  conditions  follow  from  these  definitions 

r  c  / 

11/^1(0  =  I]/^2(i)  =  ]C^3(t)  =  0, 
»  =  1  i  =  l  k=l 

r       c  r       I  c  I 

1=1  j  =  l  t=l  k=l  j  =  l  jk=l 

r       c  ^ 

1=1  j  =  l  Jk=l 

The  /i  parzimeters  are  functions  of  various  marginal  totals  in  the  table 
of  logcirithms  of  the  theoretical  frequencies,  lnF,jjb.  The  /i  parzuneters  are 
functions  of  the  logarithms  of  various  geometric  means  of  the  frequencies. 
The  expressions  for  the  /x  parameters  may  also  be  written  as  /i  =  In  F,„ , 


InF..., 

=  lnF.j.- 

InF..., 

/i3(i) 

=  lnF..*- 

InF..., 

/'I2(»i) 

=  lnF,;.- 

InFi.. - 

InF.j. +  lnF..., 

=  hiF.-.,- 

-in^;-..  - 

lnF..jt  +  lnF..., 

/^23(ii) 

=  lnF.,*- 

-b^.j. - 

-  ln^..jb  +  ln^..., 

A'l23(0t) 

=  hi^;ji- 

+  hiF,-.. 

+  In  F.J. 

+  bF..t-lnF... 

where 

i^...  is  the  overall  geometric  mean  of  all  the  frequencies  Fijk\ 
Fi„  is  the  geometric  mean  of  all  the  frequencies  Fijk  holding  i  fixed; 
F.J.  is  the  geometric  mean  of  all  the  frequencies  Fijk  holding  j  fixed; 
F.,k  is  the  geometric  mean  of  all  the  frequencies  Fijk  holding  k  fixed; 
Fij,  is  the  geometric  mean  of  all  the  frequencies  Fijk  holding  i,j  fixed; 
F,jk  is  the  geometric  mean  of  all  the  frequencies  Fijk  holding  j.k  fixed; 
Fi,k  is  the  geometric  mean  of  all  the  frequencies  Fijk  holding  i,  k  fixed. 

For  eax:h  of  the  models  introduced  above  for  three-way  tables  the  cell 
frequencies  Fijk  have  different  properties.  These  properties  imply  that 
some  of  the  /i  parameters  Eire  zero. 
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Independence  Model 


In  the  case  of  the  independence  model,  Fijk  =  *"  "*  implies  that 
\nFijk  =  A*  +  A*i(t)  +  A^2(j)  +  A*3(*)  with  all  remaining  /i  parameters  zero. 

Pariial  Independence  Model 

For  the  partied  independence  model  the  two-way  interaction  between  i  and 
j  results  in  H\2{ij)  being  non-zero.  The  other  possible  interactions  are 
zero.  The  loglinear  model  for  this  particular  partizd  independence  model  is 
therefore  given  by 

In  Fijk  =    +  Mi(t)  +  /i20)  +  /^3(i)  +  A^i2(.7)- 

If  the  table  is  collapsed  over  k,  the  resulting  two-dimensional  table  is  fitted 
exactly. 

Conditional  Independence  Model 

In  the  conditional  independence  model,  the  relationship  between  i  and  k  is 
captured  by  /ii3(ijb),  and  the  relationship  between  j  and  k  is  captured  by 
A*23(jjfe)-  Since  i  and  j  are  independent  at  every  level  of  k,  A*i2(ij)  =  0.  The 
loglinear  model  in  this  case  is 

In  Fijk  =     +  +  /i20)  +  A^3(*)  +  Hl3(ik)  +  /i230i)- 

If  the  table  is  collapsed  over  i  or  over  j,  the  resulting  two-dimensional  tables 
eire  fitted  exactly. 

No  Three-way  Interaction  Model 

In  the  no  three-way  interaction  model  all  pairs  are  related,  but  these  relar 
tionships  are  independent  of  the  third  variable.  Only  the  term  /ii23(»it)  is 
zero.  The  loglinear  model  is  given  by 

In  Fijk  =fi  +  /i2(j)  +  ;i3(fc)  +  /il2(»;)  +  /^13(i]k)  +  /i230ik). 

In  this  case  the  three  two-dimensional  tables  obtained  by  collapsing  the  fit- 
ted table  on  the  third  variable  have  cell  frequencies  identical  to  the  observed 
two-dimensional  tables. 
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Saturated  Model 


The  saturated  model  given  at  the  beginning  of  this  section  fits  the  three- 
dimensional  table  perfectly.  While  this  model  is  not  needed  to  determine 
expected  frequencies,  it  is  often  useful  for  characterizing  the  interactions 
in  a  three-way  table. 

Muhiplicative  Form  of  the  Loglinear  Model 

Taking  the  anti-logcirithm  of  both  sides  of  the  loglinear  model  yields  a 
multiphcative  model  for  the  cell  frequency  F,jjfc.  The  equation  becomes 

Fijk  =  /?oA(i)/?2(j)/?3(i)/?12(ii)/?13(»i)^23(ii)/?123(»;i). 

The  beta  parameters  axe  sometimes  useful  for  characterizing  the  variation 
in  the  table.  The  beta  parameters  are  defined  by 

/?o  =  e^    /?i(o  =  e'^^<^    /?2(;)  =  e^^0),    /?3(i)  =  e^'^") , 
/?i2(ii)  =  e'^"^^^) ,    /?i3(»ik)  =  e^-<"') ,    /?230*)  =  e^"^^'') , 
/?i23(,i*)  =  e^"^('^*>. 

Hierarchical  Models 

The  above  collection  of  models  does  not  include  all  possible  variants  using 
the  parcimeters  specified  by  the  saturated  model.  Such  models  as 

In  Fijk  =  M  +  /il(t)  +  /i20)  +  /il2(.j)  +  /^23(ijfc)  +  /il3(.ifc) 

8Lnd 

In  Fijk  =     +  +  /i20)  +  H3{k)  +  /il23(.jfc) 

have  not  been  considered.  In  order  to  maintain  the  practice  of  defining 
higher  order  terms  using  deviations  of  lower  order  terms,  the  hierarchy 
principle  is  followed.  This  principle  requires  that,  if  a  given  term  is  fitted, 
all  lower  order  terms  involving  those  variables  must  also  be  included.  The 
main  difficulty  with  non-hierarchical  models  is  the  interpretation  of  the 
fitted  pairameters.  An  additional  problem,  however,  is  that  the  iterative 
proportional  fitting  procedure  cannot  be  used  to  fit  the  model  without 
some  transformation  of  the  model  being  carried  out  first. 
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Notation  for  Loglinear  Models 

To  simplify  the  notation  for  the  remainder  of  this  chapter,  the  various 
models  in  the  hierarchical  system  will  be  denoted  by  symbols  such  as  [1], 
[23]  and  [134].  Only  the  symbols  for  the  highest  order  interaction  for  each 
variable  will  be  used.  All  lower  order  terms  containing  that  variable  are 
automatically  included  in  the  hierarchical  system.  The  model  [12],  [234], 
for  instance,  implies  that  the  terms  [23],  [24]  and  [34]  are  also  present,  while 
the  parameters  corresponding  to  [13]  and  [14]  are  not  present. 

Model  Selection 

Given  a  three-dimensional  table  of  observed  cell  frequencies  riijk,  a  ve^iety 
of  models  in  the  hierarchical  system  can  be  fitted  by  replacing  Fijk  by  Eijk 
in  the  above  formulae  for  the  loglinear  model  pzirameters.  The  expression 
for  Eijk  depends  on  the  model  being  fitted.  The  various  formulae  for  Eijk 
for  the  various  models  have  been  outhned  above.  The  goodness  of  fit  of  a 
particulzir  model  can  be  judged  using  the  goodness  of  fit  statistics 
and  L^.  A  probabiUty  level  of  0.15  to  0.25  is  usuaUy  required  to  confirm 
that  the  model  adequately  represents  the  interactions  in  the  table.  In 
practice  seek  to  fit  the  simplest  model  while  maintaining  a  reasonable 
fit. 

In  addition  to  the  overall  measure  of  goodness  of  fit,  the  likelihood  ra- 
tio statistic  has  the  advantage  that  it  can  be  used  to  compaure  nested 
models  in  the  hierarchical  system.  Let  and  denote  two  likelihood  chi- 
square  statistics  for  two  alternative  models  and  assume  that  model  2  is  the 
Icirger  model  which  contains  all  the  parameters  of  model  1.  The  conditional 
likelihood  chi-square  statistic  L2.1  =  (Lf  —  L^)  can  be  used  to  determine 
whether  model  2  is  superior  to  model  1.  Under  the  null  hypothesis  that 
model  1  is  equally  as  good  as  model  2,  the  statistic  L2.1  is  asymptoticzdly  a 
distribution  with  degrees  of  freedom  equal  to  the  difference  (d.f.  model  1 
-  d.f.  model  2).  An  example  of  such  a  test  might  involve  a  comparison  of 
the  model  [13]  [2]  to  the  model  [12]  [13]  [23].  The  null  hypothesis  would  be 
that  the  terms  [12]  and  [23]  are  superfluous. 

Summary  of  Loglinear  Model  Fitting  Procedure 

The  system  of  fitting  loglinear  models  for  the  purpose  of  explaining  inter- 
action in  a  multidimensional  contingency  table  can  be  demonstrated  by  the 
diagram  in  Figure  1. 
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Figure  1.   System  for  Fitting  and  Using  Loglinear  Models 


Product  Multinomial  Sampling 


In  product  multinomieJ  sampling,  certain  meirginals  are  held  fixed.  In 
the  three  dimensional  table  we  consider  the  two  cases  corresponding  to 
the  fixing  of  the  marginals  for  one  or  two  of  the  three  variables.  If  the 
margincds  are  fixed  for  the  first  variable,  then  the  loglinecir  model  must 
contain  the  term  This  will  ensure  that  the  fitted  marginals  are 

equal  to  the  observed  marginals  n,„ .  Similarly,  if  the  marginals  for  both 
variables  1  and  2  are  fixed,  then  the  model  must  contain  the  parameters 
A*2(j)  and  /ii2(»i)-  In  this  case  the  fitted  marginals         E,j,  and  Eij. 
are  equivalent  to  the  sample  marginals  n,..,  n,j,  and  n,j.. 

In  product  multinomial  sampling  some  of  the  variables  can  be  viewed  as 
response  variables,  while  the  remainder  can  be  viewed  as  fixed  or  controlled. 
The  control  variables  have  the  fixed  marginals,  while  the  marginals  for  the 
response  variables  are  viewed  as  an  outcome  of  the  sampling  process.  The 
weighted  least  squares  approach  assumes  product  multinomial  sampling. 
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Example 


For  the  example  presented  in  Table  2,  the  entire  set  of  loghnear  models  were 
fitted  using  the  maximum  Ukelihood  estimators  Eij^.  Table  3  sununarizes 
the  goodness  of  fit  statistics  for  the  various  models.  The  first  row  is  the 
independence  model  which  permits  all  three  marginals  to  vary  but  contains 
no  interaction. 

Rows  2,  3  and  4  show  the  results  for  the  fitting  of  the  three  possible 
partial  independence  models.  In  row  2  the  model  [2],  [13]  allows  variables  1 
£ind  3  to  be  related,  but  both  are  assumed  to  be  independent  of  variable  2. 
Similarly,  in  row  3  variables  2  and  3  sure  independent  of  1,  and  in  row  4 
variables  1  and  2  are  independent  of  variable  3. 

Table  3.  Smnnuiry  of      Goodness  of  Fit  Statistics  for  System  of  Hierarchical  Models 


[2]  =  seat  belt  usage, 

[l]  =  driver  condition,  [3]  =  injury  level 

Model 

dX 

Likelihood 

Prob 

Pearson 

Prob 

1. 

[1],  [2],  [3] 

10 

940.02 

0.0000 

1057.47 

0.0000 

2. 

[2],  [13] 

7 

444.85 

0.0000 

372.21 

0.0000 

3. 

[1].  [23] 

7 

877.16 

0.0000 

967.92 

0.0000 

4. 

[3],  [12] 

9 

542.50 

0.0000 

682.37 

0.0000 

5. 

[12],  [23] 

6 

479.69 

0.0000 

610.75 

0.0000 

6. 

[13],  [23] 

4 

382.02 

0.0000 

317.32 

0.0000 

7. 

[13],  [12] 

6 

47.34 

0.0000 

44.51 

0.0000 

8. 

[12],  [13],  [23] 

3 

5.02 

0.1705 

5.02 

0.1705 

The  three  conditioned  independence  models  axe  shown  in  rows  5,  6  and  7. 
In  row  5  the  model  [12],  [23]  requires  that  1  and  3  be  independent  at  each 
level  of  variable  2.  Similarly,  in  row  6  variables  1  and  2  are  independent 
at  each  level  of  3,  and  in  row  7  variables  2  and  3  are  independent  at  each 
level  of  1.  The  no  three-way  interaction  model  is  fitted  in  the  last  row.  In 
this  model  all  two-way  interactions  among  the  three  variables  are  assumed 
to  explsiin  all  the  interactions  in  the  table. 

An  examination  of  the  goodness  of  fit  statistics  reveals  that  the  no 
three-way  interaction  model  can  be  used  to  explain  the  interactions  among 
the  three  variables.  The  fitted  parameters  for  this  model  sire  summarized 
in  Table  4.  The  ratios  of  the  logUneax  model  parameter  estimates  to  their 
standard  error  are  also  shown  in  this  table  for  selected  psirameters.  Plots  of 
the  values  of  the  parameter  estimates  are  shown  in  Figure  2.  The  loghnear 
model  is  given  by 

log  Fijk  =  /i  +  +  /i2(i)  +  /^3(Jb)  +  /^12(ii)  +  /il3(t*)  +  A*230fc)- 
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jure  2.   Pzirameter  Estimates  for  LoglineM  Model  for  Accident  Data 
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From  the  fitted  parameters  in  Table  4  the  logarithm  of  the  geometric 
mean  of  the  expected  frequencies  is  6.002.  The  driver  condition  effects 
indicate  that  the  normal  condition  is  much  more  frequent  than  the  "been 
drinking^  condition.  The  seatbelt  usage  effects  indicate  that  many  more 
drivers  were  not  wearing  seatbelts  than  were  wearing  them.  The  injury  level 
parameters  indicate  that  the  large  majority  of  drivers  were  not  injured  and 
that  very  few  drivers  sustained  major  or  fatal  injuries. 

Table  4.  Fitted  Parameta^  for  No  Three- Way  Interaction  Loglinear  Model 

Overall  Mean  m  =  6.002 

Driver  Condition  Effects  =  1-212        mi(2)  =  -1-212  (-53.906) 

Seatbelt  Usage  Effects  ^2(1)  =  -1.119  A2(2)  =  1-119  (43.938) 

Injury  Level  Effects 

M3(i)  =   2.626  A3<2)  =  0.071         m3(3)  =  -0.376  m3(4)  =  -2322 

^  '       (97.270)  ^  '       (2.152)  '       (-10.315)  (-32.387) 

Driver  Condition  -  Seatbelt  Usage  Interaction 

Mi2(ii)  =  0.234       Al2(l2)  =  -0.234      mi2(21)  =  -0.234  Al2(22)  =  0.234 

^     '       (17.147)  ^     '  ^     '  ^  ' 

Driver  Condition  -  Injury  Level  Interaction 

Mi3(ii)  =  0.392       Ai3(i2)  =  0.006      Mi3(i3)  =  -0.061  /ii3(i4)  =  -0337 

(19.698)  ^  (0.219)  '       (-2.249)  (-5.578) 

Al3(21)  =  -0.392       M13(22)  =  -0.006         M13(23)  =  0.061  Al3(24)  =  0.337 


Seatbelt  Usage  -  Injxiry  Level  Interaction 

A23(ll)  =  0.085        ^23(12)  =  0.013  I. 
(3.714)  (0.490) 

M23(21)  =  -0.085       A23(22)  =  -0.013         ^23(23)  =  0.069  M23(24)  =  0.029 


A23(ll)  =  0.085         A23(12)  =  0.013        ^23(13)  =  -0.069  M23(14)  =  -0.029 

(3.714)  (0.490)  (-2.286)  (-0.465) 


The  interaction  effects  in  Table  4  suggest  that  normal  condition  drivers 
were  more  hkely  to  be  weaxing  seatbelts  than  drivers  who  had  been  drink- 
ing. The  driver  condition- injury  level  interactions  indicate  that,  in  com- 
pairison  to  drivers  who  had  been  drinking,  a  larger  proportion  of  drivers  in 
the  normal  category  had  no  injury  and  a  smaller  proportion  of  the  normal 
category  drivers  were  in  the  major  or  fatal  injury  category.  For  the  min- 
imal and  minor  injury  categories,  the  interaction  terms  were  quite  weak. 
The  interaction  between  driver  injury  level  and  driver  condition  therefore 
seems  to  affect  only  the  two  extremes  of  the  injury  level  range.  The  seatbelt 
usage-injury  level  interaction  appears  to  be  relatively  weak.  There  is  some 
tendency,  however,  for  seatbelt  users  to  be  over-represented  in  the  no- injury 
category  and  under-represented  in  the  minor  injury  category.  The  minimal 
injury  category  cind  the  major/fatal  category  show  only  shght  interactions 
with  seatbelt  usage. 
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In  conclusion,  we  could  say  that  a  large  majority  of  drivers  appeared 
normal,  had  not  been  wearing  seatbelts,  and  were  not  injured.  For  drivers 
wearing  seatbelts,  there  were  proportionately  fewer  who  sustained  an  in- 
jury and  proportionately  more  were  in  normad  condition  than  for  non- 
seatbelt  users.  Among  those  who  had  been  drinking,  proportionately  more 
sustained  a  minor  or  major/fatal  injury  than  among  those  who  appeared 
normal. 

A  comparison  of  the  observed  frequencies  to  the  expected  frequencies 
under  the  no  three-way  interaction  fitted  model  is  shown  in  Table  5.  The 
expected  frequencies  are  shown  in  round  brackets  under  the  correspond- 
ing observed  frequencies.  The  fit  seems  to  be  excellent  with  only  minor 
differences  in  the  minimal  and  minor  categories  for  drivers  who  had  been 
drinking  and  were  wearing  seatbelts.  The  values  of  the  standardized  resid- 
uals are  shown  in  square  brackets  for  each  cell.  The  largest  standardized 
residuals  occurred  in  the  minimal  and  minor  categories  for  drivers  who  had 
been  drinking.  These  residuals,  however,  were  quite  small  indicating  £in  ex- 
cellent fit.  In  these  two  cells  the  frequencies  are  relatively  small  cind  hence 
the  prediction  errors  are  proportionately  larger. 

Table  5.   Comparison  of  Observed  and  Expected  Frequencies 

Driver  Injury  Level 

Driver  Seatbelt 


Condition 

Usage 

None 

Minimal 

Minor 

Major/Fatal 

Yes 

12500 

604 

344 

38 

(12497.0) 

(613.3) 

(337.8) 

(37.9) 

[0.0] 

[-0.4] 

[0.3] 

[0.0] 

Normal 

No 

61971 

3519 

2272 

237 

(61974.0) 

(3509.7) 

(2278.2) 

(237.1) 

[0.0] 

[0.2] 

[-0.1] 

[0.0] 

Yes 

313 

43 

15 

4 

(316.0) 

(33.7) 

(21.2) 

(4.1) 

[-0.2] 

[1.6] 

[-1.3] 

[-0.1] 

Been  Drinking 

No 

3992 

481 

370 

66 

(3989.0) 

(490.3) 

(363.8) 

(65.9) 

[0.0] 

[-0.4] 

[0.3] 

[0.0] 

The  fitted  parameters  in  Table  4  were  converted  to  multiplicative  pa- 
rameters and  are  summarized  in  Table  6.  The  multiplicative  form  of  the 
fitted  model  is  given  by  the  equation 

Fijk  =  ^0/?l(»)^2(i)/?3(*)/?12(»j)A3(»jk)/?23(i*). 
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Table  6.  Mviltiplicative  Parameters  for  No  Three- Way  Interaction  Model 


Geometric  Mean                              /?  =  404.362 

Driver  Condition  Effects                         =  3.361 

^(2)  = 

=  0.298 

Seatbdt  Usage  Effects                     /^(i)  =  0-237 

42(2)  = 

=  3.061 

Injviry  Level  Effects 

/33(i)  =  13.824       Pz{2)  =  1-074       0^3)  =  0.687 

^(4)  = 

=  0.098 

Driver  Condition  -  Seatbelt  Usage  Interaction 
/0i2(u)  =  1.263     /!i2(i2)  =  0.792     012(21)  =  0.T92 

^12(22) 

=  1.263 

Driver  Condition  -  Injury  Levd  Intei^tion 
)3i3(ii)  =  1.481     ^3(12)  =  1.006     /3i3(i3)  =  0.941 

^3(14) 

=  0.714 

4i3(2l)  =  0.675     /3i3(22)  =  0.994     )3i3(23)  =  1.063 

y3l3(24) 

=  1.401 

Seatbelt  Usage  -  Injviry  Level  Interaction 

/323(li)  =  1.088     /323(i2)  =  1.013     /323(13)  =  0.934 

/323(14) 

=  0.971 

/323(21)  =  0.919     /323(22)  =  0.987     Aj3(23)  =  1-071 

/323(24) 

=  1.030 

Three-way  Interaction 

When  a  saturated  model  is  required  in  order  to  obtain  a  good  fit  for  a 
three-way  table,  the  three-way  interaction  /ii23(tjjb)  is  said  to  be  signifi- 
cant. The  presence  of  such  an  interaction  indicates  that  each  of  the  three 
two-way  interactions  cannot  be  assumed  to  be  constant  over  the  vairious 
levels  of  the  third.  As  zin  example,  consider  the  two-way  interaction  fin^ij)- 
This  parameter  measures  the  interaction  between  variables  1  and  2  and  is 
estimated  using  the  marginal  table  obtained  after  summing  over  the  sub- 
script k.  The  two-way  interaction  /ii2(»i)  therefore  represents  an  average 
relationship  between  vciriables  1  and  2  sununed  over  the  categories  of  the 
third  variable.  The  fact  that  /i  123(1;  fc)  is  non  zero  indicates  that  the  inter- 
action between  variables  1  and  2  varies  over  the  levels  of  variable  3. 

Example 

To  provide  an  example  interpretation  for  three-way  interaction  parauneters, 
the  estimates  /ii23(»jjb)  for  the  data  in  Table  2  are  shown  in  Table  7.  The 
largest  parameter  estimate  of  0.086  for  the  category  normal,  seatbelt  yes, 
and  minor  injury  allows  us  to  conclude  the  following: 
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(1)  The  proportion  of  individuals  who  sustained  a  minor  injury  while  weeir- 
ing  seatbelts  was  greater  for  normal  condition  drivers  than  for  drivers 
who  had  been  drinking.  In  other  words,  the  interaction  between  seat- 
belt  usage  and  injury  level  is  not  independent  of  driver  condition. 

(2)  The  proportion  of  individuals  who  sustained  a  minor  injury  while  in 
normal  condition  was  greater  for  those  wearing  seatbelts  than  for  those 
not  wearing  seatbelts.  Thus  the  interaction  between  driver  condition 
and  injury  level  depends  on  seatbelt  usage. 

(3)  The  proportion  of  individuals  who  wore  seatbelts  while  in  normal  con- 
dition was  greater  for  those  who  sustained  a  minor  injury  than  for 
those  who  sustained  a  minimal  injury.  The  interaction  between  seat- 
belt  usage  and  driver  condition  varies  with  the  injury  level. 


T£jDle  7.   Threeway  Interaction  Terms  from  Saturated  Model 


Condition 

Usage 

None 

Minimal 

Minor 

Major 

Normal 

YES 

-0.{X)7 

-0.080 

+0.086 

0.000 

NO 

+0.007 

+0.080 

-0.086 

0.000 

Been  Drinking 

YES 

+0.007 

+0.080 

-0.086 

0.000 

NO 

-0.007 

-0.080 

+0.086 

0.000 

2.  Logistic  Regression 

In  the  multiple  linear  regression  model  the  dependent  variable  Y  is  always 
assumed  to  have  an  interval  scale.  The  explanatory  variables  in  x,  however, 
can  be  either  interval  scaled  or  categorical.  If  the  dependent  variable  is 
categorical,  a  logistic  regression  model  can  be  used.  The  discussion  here  is 
restricted  to  the  case  of  a  dichotomous  dependent  variable. 


Tht  Point  Binomial 


We  assume  that  individuals  or  objects  can  be  classified  into  one  of  two 
mutually  exclusive  categories  Aot  B,  and  that  the  probabilities  associated 
with  these  two  categories  are  p  and  (1  —  p)  respectively.  As  an  example, 
the  categories  A  and  B  might  represent  the  events  that  a  business  firm  will 
or  will  not  go  bankrupt  in  the  next  yecir. 
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We  define  the  dummy  icindom  variable  Y  to  indicate  the  two  categories 
by  letting  Y  =  1  for  category  A  and  V  =  0  for  category  B.  The  probability 
density  for  Y  is  therefore  given  by 

/(y  I  p)  =  P»'(i -?)('->-) 

which  is  the  density  of  a  point  binominal. 
Probability  as  a  Function  of  Other  Variables 

To  continue  the  example  of  business  firms  and  bankruptcy,  we  assume  that 
the  probability  of  bankruptcy  depends  on  a  measure  of  finsuicicd  health  D, 
where  D  is  a,  linear  function  given  by  D  =  Po  +  ^iX  and  where  X  is 
a  measure  of  a  company's  abihty  to  repay  its  debts,  such  as  debt-equity 
ratio.  In  other  words  the  probability  of  bankruptcy  is  a  function  of  D  and 
will  be  denoted  by  p{D).  For  an  individual  firm  i  with  debt-equity  ratio  x,-, 
di  =  /?o  +  PiXi  and  the  probabihty  density  for  yi  is  given  by 

f{yi\pW)  =  \p(diT[i-p{di)f'-''^. 

For  a  random  sample  of  n  firms  we  observe  {di,d2, . . .  ,dn),  and  the  joint 
density  for  (yi ,  y2,       yn)  is  given  by 

f(yuy2y . . . ,  yn  I  p{di),p{d2), . .  .,p{dn)) 

...[pK)Mi-pK)](^-s'-) 
»=i 

Note  here  that  the  parameters  /3o  and  f3i  are  assumed  to  be  constant  across 
the  complete  sample. 

The  Logit  Function 

To  be  able  to  relate  the  value  y  of  the  response  variable  Y  to  the  \a\ue  d  of 
the  variable  D,  a  more  specific  assumption  about  the  form  of  the  function 
p(d)  is  required.  The  logistic  regression  model  assumes  that  p{d)  is  given 
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Logistic  Regression  With  c  Explanatory  Variables 

The  logistic  regression  model  can  be  extended  to  include  c  explanatory 

c 

variables.  In  this  case  it  is  assumed  that  p  =  p{D)  where  £>  =  /?o4-  J2Pj-^3 

i=i 

is  a  Unear  function  of  c  explanatory  Vciriables.  Thus  for  the  bankruptcy 
example  we  assume  that  there  are  a  total  of  c  variables  that  are  related  to 
the  probability  of  bankruptcy. 
The  logit  of  p  is  given  by 

c 

]n\p/il-p)]  =  D  =  l3o  +  '^/3jXj 
i=i 

which  has  the  form  of  a  multiple  regression  model.  The  estimation  of  the 
parameters  /?o,/?i, •  •  •  ,/?c  is  usually  obtained  using  maximum  likehhood, 
which  must  be  determined  using  Newton-Raphson  procedures. 

Inference  for  the  Dichotomous  Logistic  Regression  Model 

The  dichotomous  logistic  regression  model  assumes  that  the  logit  func- 
tion ln[p/(l  —  p)]  can  be  modelled  as  a  linear  function  of  a  set  of  ex- 

c 

plaiiatory  variables  Po  -j-  Y^Xj^j.  Given  a  random  sample  of  observations 
J=i 

(y,-,  x,i ,  Xj2,  . . . ,  a^ic),  i  =  1, 2, . . . ,  n,  the  maximum  likelihood  estimator  of 
the  coefficient  can  be  obtained  as  outlined.  In  comparison  to  the  multiple 
linear  regression  model,  the  coefficients  in  this  case  must  be  interpreted 
differently.  A  marginal  one  unit  increase  in  Xj  brings  about  an  increase 
in  ln[p/(l  —  p)]  of  the  amount  pj.  The  magnitude  of  the  increase  in  p, 
however,  depends  on  the  initial  value  of  p. 

Comparing  Nested  Models 

Inferences  regarding  the  coefficients  in  the  logistic  regression  model  can  be 
made  by  comparing  models  and  sub  models  using  a  likelihood  ratio  test. 
To  compare  a  full  model  with  c  explanatory  variables  plus  an  intercept  to 
a  reduced  model  with  (c  —  q)  explanatory  variables  plus  intercept,  the 
logarithm  of  the  likelihood  ratio  yields  the  statistic  —  2[ln  Lq  —  In  L],  which 
has  a  distribution  with  q  degrees  of  freedom  if  the  q  deleted  variables 
are  superfluous.  L  is  the  likelihood  function  for  the  full  model,  while  Lq  is 
the  likelihood  function  for  the  reduced  model.  The  reader  may  recall  that 
this  approach  was  also  used  for  loglinear  models  in  the  three-dimensional 
contingency  table  discussed  above. 


36 


p 

1 

c 

d 

Figure  3.  Shape  of  Logistic  Distribution  Functions 

by  the  distribution  function  for  a  logistic  density  G{d).  Hence        =  G{cl), 
-GO  <  d  <  oo.  The  shape  of  G((f)  is  illustrated  in  Figure  3. 
The  equation  for  p{d)  is  given  by 

K<f)  =  e''/(l  +  e''), 

which  is  the  logistic  distribution  function.  The  shape  of  p(d)  for  the  logis- 
tic is  quite  similar  to  the  shape  for  a  normal  distribution  function.  As  we 
outline  next,  the  logistic  transformation  lends  itself  to  a  useful  explicit  func- 
tional relationship  between  p{d)  and  d.  The  standardized  logistic  density 
is  given  by 

/H  =  e-/(l  +  e-)2 

and  the  mean  and  variance  of  this  density  are  0  and  ir^/Z  respectively.  The 
standard  normal  and  standardized  logistic  distribution  yield  very  similar 
shaped  densities  and  distribution  functions.  Like  the  standard  normal  den- 
sity, the  standardized  logistic  density  has  a  median  and  mode  of  zero  and  a 
skewness  of  zero.  The  kurtosis  of  the  logistic  density  is  4.2  which  indicates 
fatter  tzuls  than  the  normal  which  has  a  kurtosis  of  3.  The  standardized 
logistic  distribution  with  w*  =  wl^fn^-jz  has  shghtly  heavier  tails  than  the 
standard  normal  distribution. 

An  important  advaintage  of  the  logistic  distribution  in  this  context  is  that 
the  logit  transformation  ln[p/(l  —  p)]  has  the  form 


Therefore,  if  d  is  assumed  to  be  a  linear  function  of  x,  d  =  or-f the  logit 
has  the  fzunihar  linear  model  form.  This  logit  model  is  usually  referred  as 
the  logistic  regression  model. 


Goodness  of  Fit 


A  pseudo  measure  of  goodness  of  fit  is  given  by 

=  1-lnI/lni; 

where  Lq  denotes  the  likeUhood  function  value  when  all  variables  are  ex- 
cluded except  the  constant  term  /Sq.  Thus  the  sample  value  of  Lq  is  the 
value  of  L  evaluated  using  the  sample  proportion  for  the  majcimum  likeli- 
hood estimator  of  p.  This  measures  the  proportion  of  uncertainty  in  the 
data  that  is  explained  by  the  model.  If  the  full  model  is  a  perfect  indicator, 
then  L  =  1,  InL  =  0,  2ind  R"^  =  1,  If  the  reduced  model  yields  the  same 
likelihood  as  the  full  model,  then  In  L  =  In  Lq  and  =  0.  In  this  case  the 
explanatory  variables  contribute  nothing  to  the  likelihood. 
An  alternative  measure  of  goodness  of  fit  is  given  by 

Rl  =  {L""-Lf'")/{l-Lf'-'). 


Hosmer-Lemeshow  Goodness  of  Fii  Test 

Since  a  large  majority  of  the  observations  (r/j ,  iCji ,  i,-2,  •  •  •  >  ^ie)  »  =  1 , 2, . . . ,  n 
are  unique  in  the  sense  that  in  general  no  two  observations  yield  identical 
values  on  all  variables,  the  fitted  model  cannot  be  evaluated  using  the 
goodness  of  fit  tests  introduced  for  the  contingency  table.  A  goodness  of 
fit  test,  known  as  Hosmer-Lemeshow,  divides  the  range  of  p  [0, 1]  into  s 
mutually  exclusive  categories,  and  then  a  comparison  of  the  observed  and 
predicted  frequencies  be  carried  out  using  a  statistic.  The  categories 
can  be  determined  by  ranking  the  np  values  and  then  dividing  them  into  s 
equed  groups  or  by  dividing  the  range  of  p  into  s  equal  intervals. 

We  denote  the  actual  frequency  in  group  j  by  oj,  the  predicted  fre- 
quency by  Tij ,  and  the  average  value  of  p  in  group  j  by  pj .  The  statistic 

jP^^^Ov  is  approximately  x^  with  [s  —  2)  degrees  of  freedom  if  the 
j=i  ^jPj\^  Pj) 

fitted  logistic  regression  model  is  correct. 
Example  —  Bivariaie  Relationships 

To  provide  examples  for  the  discussion  of  qualitative  response  regression 
models  the  data  sununarized  in  Table  8  will  be  used.  The  data  represents  a 
sample  of  100  observations  on  married  women  selected  from  the  Michigan 
Panel  Study  of  Income  Dynamics.  The  variables  THISYR  and  LASTYR 
are  indicator  variables  for  whether  the  wife  worked  (=1)  or  did  not  work 
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(=0)  in  the  current  year  and  the  previous  year  respectively.  The  variables 
CHILD  1,  CHILD2,  and  BLACK  are  dummy  variables  indicating  whether 
the  wife  has  children  under  2  (CHILDl),  children  between  age  2  and  age  6 
(CHILD2)  or  is  BLACK  respectively.  Finally  the  three  variables  AGE, 
EDUC  and  HUBINC  are  measures  of  the  years  of  age  and  years  of  education 
of  the  wife  ajid  the  income  of  the  husband,  respectively.  The  variables 
THISYR  and  LASTYR  will  be  used  as  response  variables  and  the  remaining 
variables  wiU  be  used  as  explanatory  variables. 

To  examine  the  bivaxiate  relationships  between  THISYR  £ind  LASTYR 
and  each  of  the  six  explanatory  variables,  single  variable  logistic  regression 
models  were  estimated.  The  results  are  summarized  in  Table  9. 

To  iUustrate  the  information  contained  in  Table  9,  we  shsdl  examine 
in  detail  the  results  for  the  interval  seeded  V2iriable  HUBINC  and  for  the 
categorical  variable  CHILDl.  For  the  variable  HUBINC  the  fitted  logistic 
regression  model  in  the  case  of  THISYR  has  the  equation 

ln[p/(l  -  p)]  =  1.6001  -  0.0675  HUBINC 

where  p  is  the  probabihty  that  the  wife  will  choose  to  work  THISYR.  The 
log  of  the  hkehhood  ratio  for  the  model  is  given  by  In  Li  =  —56.982  while 
the  log  of  the  likeUhood  ratio  with  HUBINC  omitted  is  InLo  =  -59.295. 
The  likehhood  ratio  statistic  is  therefore  — 2[ln  Lq  —  In  Li]  =  4.62  which 
has  a  p-value  of  0.0315  for  a  1  d.f.  x^-  The  fitted  equation  indicates  that 
as  HUBINC  decreases  the  value  of  p  increases.  From  Table  8  the  range  of 
HUBINC  is  0  to  54.3  and  hence  the  value  of  the  logit  varies  from  +1.6001 
to  —2.0652.  The  range  of  values  for  p  is  therefore  given  by 

p  =  exp[+1.6001]/(l  +  exp[+1.6001])  =  0.83 

and 

p  =  exp[-2.0652]/(l  +  exp[-2.0652])  =  0.11. 

For  the  categorical  variables  CHILDl,  CHILD2  and  BLACK,  dummy 
variable  coding  was  used  with  CHILDl  =  1,  CHILD2  =  1  and  BLACK  =  1 
indicating  the  presence  of  a  child  under  2,  a  child  of  age  2-6  and  an  indi- 
vidual of  the  black  race.  For  the  variable  CHILDl  the  fitted  model  is  given 
by 

ln[p/(l  -  p)]  =  1.1272  -  2.7366  CHILDl. 

The  probeibihty  that  the  woman  chooses  to  work  therefore  varies  from  p  = 
exp[1.1272]/(l4-exp[1.1272])  =  0.76  for  CHILDl  =  0  top  =  exp[-1.6094)/ 
(1  +  exp[-1.6094])  =  0.17  for  CHILDl  =  1.  The  significance  of  the  co- 
efficient of  CHILDl  is  obtained  from  -2[lnLo  -  InLi]  =  -2[-59.295  - 
(-55.006)]  =  8.58  which  has  a  p-value  of  0.0034  for  a  1  d.f.  x^- 
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Table  8.  Pull  Time  Work  Outside  the  Home  for  Married  Women 


OBS 

LAb 1 Y  K 

InloYn, 

CrllLDl 

L>n.lLD2 

rJLAUis. 

EiUUL/ 

1 

0 

0 

0 

0 

0 

4^40 

12 

42 

2 

0 

0 

0 

1 

0 

13.648 

12 

31 

3 

1 

1 

0 

1 

1 

4.973 

10 

38 

4 

0 

1 

0 

0 

0 

8.427 

12 

46 

5 

0 

1 

0 

0 

0 

18.320 

18 

46 

6 

0 

1 

0 

1 

1 

7.680 

10 

29 

7 

1 

1 

0 

1 

0 

5.612 

12 

25 

8 

0 

0 

0 

1 

0 

13.554 

12 

32 

9 

0 

0 

0 

0 

5.329 

12 

26 

10 

1 

0 

0 

0 

10.511 

12 

29 

11 

1 

0 

0 

0 

10.486 

12 

34 

12 

1 

0 

0 

0 

14.071 

16 

38 

13 

1 

0 

0 

0 

9.024 

12 

32 

14 

1 

0 

1 

0 

14.329 

12 

36 

15 

1 

1 

0 

0 

5.118 

18 

28 

16 

1 

0 

0 

1 

3.044 

12 

37 

17 

1 

0 

0 

1 

2.640 

7 

38 

18 

1 

0 

0 

1 

2.050 

7 

43 

19 

0 

0 

1 

1 

6.750 

12 

23 

20 

0 

0 

0 

0 

3J383 

12 

24 

21 

1 

0 

0 

0 

6.630 

12 

40 

22 

1 

0 

0 

0 

7.000 

12 

46 

23 

1 

0 

0 

0 

8.815 

12 

42 

24 

1 

0 

0 

0 

3.450 

12 

46 

25 

0 

0 

0 

0 

12.031 

12 

42 

26 

1 

0 

0 

1 

6.144 

12 

31 

27 

0 

0 

0 

1 

0 

11.513 

12 

39 

28 

0 

1 

0 

1 

0 

12.167 

12 

46 

29 

0 

0 

1 

0 

0 

9.968 

16 

28 

30 

0 

0 

1 

0 

0 

5.888 

12 

23 

31 

1 

1 

0 

0 

0 

10.232 

12 

32 

32 

1 

1 

0 

0 

0 

8.017 

12 

40 

33 

1 

1 

0 

0 

0 

11.686 

12 

45 

34 

1 

0 

0 

1 

0 

28.363 

12 

31 

35 

1 

1 

0 

0 

1 

4343 

7 

46 

36 

1 

1 

0 

0 

0 

10.554 

12 

38 

37 

1 

1 

0 

1 

0 

2.484 

10 

29 

38 

0 

0 

0 

0 

0 

5.672 

12 

44 

39 

1 

1 

0 

0 

1 

13.319 

18 

31 

40 

1 

1 

0 

0 

1 

7.678 

18 

35 

41 

1 

1 

0 

0 

0 

7.162 

12 

24 

42 

0 

0 

0 

0 

0 

7.804 

12 

34 

43 

0 

1 

0 

1 

0 

13.648 

16 

28 

44 

0 

0 

0 

1 

0 

9.311 

12 

27 

45 

1 

1 

0 

0 

0 

27.938 

12 

46 

46 

1 

1 

0 

0 

6.704 

12 

27 

47 

1 

1 

0 

0 

0 

7.711 

12 

32 

48 

1 

1 

0 

0 

0 

8.576 

16 

38 

49 

0 

1 

0 

1 

0 

7.223 

16 

26 

50 

0 

0 

1 

0 

0 

11.259 

16 

31 

40 


Table  8.   Pull  Time  Work  Outside  the  Home  for  Married  Women  (continued) 
OBS    LASTYR  THISYR  CHILDl  CHILD2  BLACK  HUBINC    EDUC  AGE 


51 

0 

0 

0 

1 

0 

26.063 

12 

30 

52 

0 

0 

0 

11.776 

12 

42 

53 

I 

I 

0 

0 

0 

12.793 

18 

46 

54 

I 

I 

0 

0 

0 

11.080 

12 

44 

55 

I 

I 

0 

0 

0 

7.074 

12 

31 

56 

I 

I 

0 

1 

0 

6.679 

12 

36 

57 

I 

0 

0 

0 

15.868 

12 

45 

58 

I 

I 

0 

0 

0 

7.972 

16 

42 

59 

1 

0 

1 

0.000 

12 

29 

60 

I 

I 

0 

0 

0 

3.030 

10 

43 

61 

I 

I 

0 

0 

0 

2.970 

16 

27 

62 

I 

I 

0 

0 

0 

9J305 

12 

40 

63 

0 

0 

0 

8.125 

12 

30 

64 

0 

1 

1 

13.033 

10 

29 

65 

I 

I 

0 

0 

1 

0.000 

12 

39 

66 

I 

I 

0 

1 

1 

2.781 

12 

30 

67 

I 

I 

0 

0 

1 

3.010 

12 

35 

68 

0 

0 

0 

26.056 

12 

40 

69 

0 

0 

0 

5.795 

12 

46 

70 

I 

I 

0 

1 

0 

0.000 

12 

36 

71 

I 

I 

0 

1 

0 

2.639 

12 

28 

72 

I 

I 

0 

0 

0 

9.087 

12 

24 

73 

I 

0 

0 

0 

12.312 

12 

34 

74 

I 

0 

0 

0 

7.325 

12 

33 

75 

I 

0 

0 

0 

3.517 

10 

26 

76 

I 

0 

0 

0 

17.140 

12 

35 

77 

I 

I 

0 

0 

0 

24.054 

12 

40 

78 

1 

I 

0 

0 

1 

6.144 

12 

42 

79 

I 

0 

0 

1 

13.211 

12 

34 

80 

1 

0 

0 

0 

9.309 

12 

45 

81 

1 

1 

0 

0 

0 

3.135 

10 

40 

82 

1 

1 

0 

0 

0 

2.935 

10 

45 

83 

I 

0 

0 

0 

9.607 

12 

41 

84 

I 

0 

0 

0 

10.629 

12 

44 

85 

0 

0 

0 

8.207 

12 

24 

86 

0 

0 

0 

9.772 

12 

42 

87 

0 

0 

0 

8.955 

12 

46 

88 

0 

0 

0 

6.204 

10 

46 

89 

0 

0 

0 

1 

9.378 

12 

32 

90 

0 

0 

0 

0 

54.281 

12 

45 

91 

1 

0 

1 

0 

7.525 

12 

31 

92 

0 

1 

0 

0 

11.504 

12 

32 

93 

0 

0 

0 

0 

5.763 

12 

42 

94 

0 

0 

1 

0 

5.683 

12 

32 

95 

0 

0 

0 

0 

10.937 

12 

40 

96 

1 

0 

0 

0 

9.361 

12 

45 

97 

0 

0 

0 

1 

0 

6.342 

12 

35 

98 

1 

0 

0 

0 

0 

7.160 

10 

31 

99 

1 

0 

0 

1 

0 

7.788 

12 

31 

100 

1 

1 

0 

0 

1 

2.402 

10 

25 

41 
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An  exsunination  of  Table  9  reveals  that  the  most  important  variables 
in  predicting  whether  a  woman  will  choose  to  work  THISYR  are  AGE, 
HUBINC  and  CHILDl.  The  coefficients  of  these  variables  indicate  that  p 
tends  to  be  larger  if  AGE  is  large,  HUBINC  is  small,  CHILDl  =  0  and  if 
CHILD2  =  0.  For  the  variable  LASTYR  the  most  significant  explanatory 
variables  are  HUBINC,  CHILDl  and  CHILD2.  The  coefficients  in  these 
three  models  indicate  that  p  increases  with  decreasing  HUBINC,  and  that 
p  is  larger  if  CHILDl  =  0  and  if  CHILD2  =  0. 


Example  -  Logistic  Regression  With  Multiple  Explanatory  Variables 

To  determine  how  the  explanatory  variables  together  predict  p,  a  logistic 
regression  model  was  fitted  using  all  six  explanatory  variables.  The  fitted 
models  for  THISYR  and  LASTYR  are  given  by 

THISYR 

b[p/(l  -p)]  =  -  6.0624  -  0.1079  HUBINC  +  0.4777  EDUC 

(0.0472)      (0.0040)  (0.0148) 

+  0.0773  AGE  +  1.5451  BLACK 

(0.0765)  (0.0708) 

-  4.5179  CHILDl  -  1.1238  CHILD2, 

(0.0003)  (0.0621) 

LASTYR 

]n\p/(l  -p)]  =  +  6.3641  -  0.0799  HUBINC  -  0.0911  EDUC 

(0.0076)      (0.0257)  (0.4638) 

-  0.0870  AGE  -  0.0879  BLACK 

(0.0423)  (0.8937) 

-  3.6948  CHILDl  -  1.6928  CHILD2. 

(0.0008)  (0.0057) 

The  fitted  logistic  regression  model  for  THISYR  indicates  that  at  the 
maigin  the  probabiUty  that  a  woman  will  choose  to  work  increases  with 
decreases  in  HUBINC,  but  decreases  with  decreases  in  AGE  and  EDUC. 
For  the  dummy  variables,  a  woman  of  the  black  race  is  more  likely  to  work, 
while  if  children  are  present  the  woman  is  less  likely  to  work.  For  the 
variable  LASTYR  the  variables  HUBINC,  CHILDl  and  CHILD2  have  the 
SBLme  impact  as  in  the  case  of  THISYR  while  the  remaining  variables  are 
insignificant. 

The  log  likelihoods  for  the  two  models  are  —44.044  and  —53.314  for 
THISYR  and  LASTYR  respectively.  Excluding  all  six  variables  yields  log 
likelihoods  of  -59.295  and  —64.745.  The  likelihood  ratio  statistics  for 
the  significance  of  all  six  variables  are  given  by  —2[— 59.295  —  (—44.044)]  = 
30.502  and  -2[-64.745- (-53.314)]  =  22.862  which  have  j>-values  of  0.000 
and  0.001  when  compared  to  a      distribution  with  6  d.f.  The  pseudo 
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values  are  given  by  [1  -  44.044/59.295]  =  0.26  and  [1  -  53.314/64.745]  = 
0.18,  respectively.  The  Hosmer-Lemeshow  P  values  are  0.05  and  0.46 
for  the  variables  THISYR  and  LASTYR  respectively. 


The  FiUed  Model  and  Classification 


The  fitted  logistic  regression  model  can  be  used  to  obtain  the  value  of  p,-  for 

r 

each  observation  by  determining  the  value  of  ln[p,/(l  — p,)]  =  /?o+  Pj^ij 

i=i 

and  then  solving  for  p,-.  The  value  of  pi  is  given  by  pi  =  e^^ /{\  +  e*^). 
Assume  that  the  observation  is  placed  in  the  category  y  =  0  if  p,-  <  0.50, 
and  otherwise  the  observation  is  placed  in  the  category  Y  =  1.  A  prediction 
success  matrix  or  confusion  matrix  in  this  case  can  be  constructed  as  shown 
below. 


Pi  <  0.50 
p  >  0.50 


IVue  Category 

y  =  0      y  =  1 


"00 

"01 

'^10 

nil 

n  =  (noo  +  "01  +  "io"ii) 


This  table  shows  the  distribution  of  the  predictions  for  each  of  the  two 
categories.  The  proportion  of  correctly  classified  observations  is  given  by 
("00  +  "ii)/"-  The  logistic  regression  model  therefore  provides  a  discrimi- 
nant function  which  can  be  used  to  classify  unknowns. 


Example 


It  is  of  interest  to  examine  the  abilities  of  the  two  fitted  models  to  pre- 
dict the  values  of  THISYR  and  LASTYR.  If  no  explanatory  variables 
are  included  in  the  model,  the  probabilities  based  on  the  observations  are 
p[THISYR  =  1]  =  0.72  and  p[LASTYR  =  1]  =  0.65.  Prediction  success 
tables  based  on  these  probabilities  are  therefore  given  by 


Predicted 
THISYR 
0  1 

THISYR 


8 

20 

20 

52 

28  72 


Predicted 
LASTYR 
0  1 

LASTYR 


12 

23 

23 

42 

35  65 
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We  would  therefore  expect  to  predict  correctly  60%  of  the  values  of  THISYR 
and  54%  of  the  values  of  LASTYR.  An  equal  priors  model  would  only  be 
expected  to  predict  50%  correctly. 

To  determine  the  predictions  based  on  the  fitted  logistic  models,  values  of 
p  were  determined  for  both  models  for  all  100  observations.  If  p  <  0.50  for 
a  particular  individual,  then  that  individual  was  placed  in  the  *not  work' 
category;  otherwise  the  individual  was  placed  in  the  'work*  category.  The 
prediction  success  tables  are  shown  below. 


Predicted 
THISYR 


Predicted 
LASTYR 


THISYR 

0 

1 

LASTYR 

0 

1 

0 

13 

15 

28 

0 

14 

21 

1 

5 

67 

72 

1 

8 

57 

18 


82 


22 


78 


35 
65 


For  the  variable  THISYR  the  use  of  the  fitted  logistic  regression  model 
results  in  a  correct  classification  for  80%  of  the  cases,  while  for  the  variable 
LASTYR,  use  of  the  fitted  model  results  in  a  correct  classification  for  71%. 
The  increases  in  %  correctly  classified  as  a  result  of  the  fitted  logistic  re- 
gression model  are  20%  and  17%  respectively.  In  other  words,  in  the  case 
of  THISYR  an  additional  20  of  the  100  cases  were  correctly  classified,  while 
for  LASTYR  an  additionsd  17  of  the  100  cases  were  correctly  classified. 
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STATISTICAL  GRAPHICS 
Marion  Herbut 
Alberta  E^nviromnental  Centre 
Vegreville,  Alberta 

Introduction 

The  use  of  SAS/Graphl  to  produce  regression  analysis  plots  and  plots  with  standard 
deviation/standard  error  bars  is  demonstrated.  Examples  are  given  for  enhancing  graphical 
output,  manipulating  data,  and  transferring  gpraphical  output  to  other  graphics,  word  processing 
and  desktop  publishing  software  for  further  modification. 


SAS/Graph  regression  analysis  plots 

The  ability  of  SAS/Graph  to  analyze  data  statistically  and  manage  large  data  sets  simplifies 
regression  analysis.  The  GPLOT  procedure  is  used  to  plot  Y  against  X  variables.  The  regression 
analysis  is  specified  in  the  SYMBOL  statement  with  the  interpolation  option  (I=optiony  IsRL 
requests  a  linear  regression  and  IaRL0  would  set  the  intercept  to  zero. 

The  regression  equations  can  be  linear(RL),  quadratic  (RQ)  or  cubic  (RC).  Confidence  limits  can 
be  added  to  the  regression  line  by  further  specification  in  the  I=option.  IsRLCLI99  requests  a  linear 
regression  with  hnes  representing  99-percent  confidence  limits  on  individual  predicted  values. 
I=RLCLM90  requests  the  same  smalysis,  but  with  90-percent  confidence  Umits  on  the  mean 
predicted  values. 

The  following  example  produces  a  regression  analysis  plot  with  system  defaults. 


1  SAS  Institute,  Inc.  SAS/Graph  Guide  for  personal  computers.  Version  6  edition.  Cary,  N.C.: 
SAS  Institute,  Inc.,  1987.  500  pp. 
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GOPTIONS  DEVICE=VGA16; 

DATA  LEAVES; 

INPUT  X  Y; 

CARDS; 

.1267  11.33 

.1067  8.00 

.1067  8.67 

.1867  14.00 

.1733  12.00 

.1500  13.00 

.0867  6.67 

.2850  22.00 

.0400  3.33 


SYMBOL  I«RLCLM90; 
PROC  GPLOT; 
PLOT  Y*X; 
RUN; 

This  plot  can  be  enhanced  by  supplying  specific  values  for  various  options.  For  example,  title  and 
axes  labels  can  be  added  using  different  text  fonts  and  heights,  the  X  and  Y  axis  increments  can  be 
modified,  and  data  point  symbols  can  be  added. 

GOPTIONS  DEVICE=VGA16  GUNIT=CM  HTEXT=.8  FTEXT= DUPLEX; 
TITLE  RLCLMSO'; 

AXISl  LABEL=(A=90  Percentage  of  plant  strata  infested') 

MINOR=NONE 

ORDER=0TO25BY5; 
AXIS2  LABEL=('Number  /  plant  stratum  ) 

MINOR=NONE 

ORDER=0  TO  .3  BY  .05; 
SYMBOL  V=CIRCLE  I-RLCLM90  H=.5  L=l; 
PROC  GPLOT  DATA=LEAVES; 
PLOT  Y*X  /  VAXIS=AXIS1  HAXIS=AXIS2  FRAME; 
RUN; 
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RLCLM90 

y 

« 

o 

0 

15 

iio. 

0 

o>  5 

0 

c 

e  0. 

«»-  0.00 

— 1 — — — 1  1  1  1 — 

0.05      0.10      0.15      0.20  0.25 

0.30 

Number  /  plant  stratum 

Text  height  and  font  are  specified  in  the  GOPTIONS  statement  and  remain  in  effect  until 
specified  otherwise  or  until  the  session  is  terminated.  Output  can  be  further  controlled  by 
specifying  options  within  TITLE,  FOOTNOTE,  NOTE,  AXIS,  LEGEND,  PATTERN  and 
SYMBOL  statements. 

The  SAS/Graph  manual  documents  in  detail  the  many  options  available  to  enhance  graphics 
output  text  and  design. 

SAS/G^ph  standard  deviation/standard  error  bar  plots 

As  with  regression,  the  I=oDtion  in  the  SYMBOL  statement  is  used  with  the  GPLOT  procedure  to 
add  standard  deviation/standard  error  bars  to  the  data  plotted.  I»STD  is  used  when  multiple  Y 
values  occur  for  each  level  of  X,  and  it  is  desired  to  join  the  mean  Y  value  with  (±)  1,  2,  or  3 
standard  deviations  for  each  X.  I^STDM  computes  the  standard  error.  laSTDlJT  computes  1 
standard  deviation,  the  means  are  connected  from  bar  to  bar  (J),  and  tops  and  bottoms  are  added  to 
each  bar  (T).  Further  options  are  available  for  the  interpolation  values  (I=Qption). 
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In  the  next  example  I-STDMJT  requests  standard  error  bars  (default  value  of  2)  with  means 
connected  from  bar  to  bar  and  tops  and  bottoms  added  to  each  bar.  This  program  also  illustrates 
some  data  manipulation  to  overlay  two  lines  on  one  plot.  The  original  data  set  is  subset  into  two 
data  sets,  the  variable  to  be  plotted  renamed  imiquely  for  each  isolate,  and  finally  the  two  data  sets 
are  merged  again  into  one.  The  example  also  demonstrates  how  the  output  can  be  controlled 
separately  within  the  TITLE,  SYMBOL  and  AXIS  statements. 

Example: 

GOPTIONS  DEVICE=VGA16  GUNIT=CM; 

DATA  M;  SET  F2.SAIN; 

IF  CHEMICAL=  'M'; 

RENAME  PERCENT=PERCENTM; 

DATA  N;  SET  F2.SAIN; 

IF  CHEMICAL=  N'; 

RENAME  PERCENT=PERCENTN; 

DATA  PLOT;  SET  M  N; 

TITLE  F=DUPLEX  H=l  Effects  of  metalaxyl  on  growth  of  sainfoin'; 
SYMBOLl  I-STDMJT  V=NONE  W=2  L=l  C=B; 
SYMB0L2  I-STDMJT  V=NONE  W=2  L=l  C=B; 
PROG  GPLOT; 

AXISl  VALUE=(H=.9  F=DUPLEX) 

ORDER=OT01BY.l 

WIDTH=2  MINOR=NONE 

LABEL=(A=90  H=.9  F=DUPLEX  Percent); 
AXIS2  VALUE=(H=.9  F=DUPLEX) 

ORDERS  0 1 2  3  7  8 12 13 16  18 

WIDTH=2  0FFSET=(.5CM) 

LABEL=(H=.9  F=DUPLEX  Isolate  ); 
PLOT  (PERCENTM  PERCENTN)*  ISOLATE  /  OVERLAY 
VAXIS=AXIS1  HAXIS=AXIS2; 

RUN; 
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Effects  of  metalaxyl  on  growth  of  sainfoin 
1.0h  t 


isolate 


If  the  data  set  already  has  standard  deviation/standard  error  values,  some  data  manipulation  is 
necessary  to  create  a  plot  with  error  bars.  Multiple  values  of  Y  for  each  value  of  X  will  have  to  be 
created.  Data  set  STD  contains  the  variables  ISOLATE,  CHEMICAL,  MEAN  and  STD.  The 
following  SAS  DATA  step  can  be  used  to  reshape  the  data: 

DATA  NEWSTD; 
SET  STD; 

PERCENT=MEAN+STD;  OUTPUT; 
PERCENT=MEAN;  OUTPUT; 
PERCENT=MEAN-STD;  OUTPUT; 
DROP  MEAN  STD; 
RUN; 

In  the  DATA  step,  the  original  data  set  (STD)  is  read  in.  Three  values  for  PERCENT  are  created 
for  each  ISOLATE:  the  original  mean,  the  mean  plus  the  standard  deviation  and  the  mean  minus 
the  standard  deviation. 
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Again  the  I=option  in  the  SYMBOL  statement  is  used  with  the  GPLOT  procedure  to  add  error  bars  to 
the  plot.  I«HILO  is  used  in  this  case  since  the  mean  and  standard  deviation  are  already  known. 

Similar  options  are  available  as  are  for  I«STD. 
Transferring  graphics 

Frequently  it  is  necessary  to  incorporate  SAS/Graph  output  within  a  document,  or  the  output 
requires  further  modification.  If  so,  this  can  be  facilitated  by  correct  device  driver  selection  in 
SAS/Graph. 

The  graphics  within  this  document  were  incorporated  using  WordPerfect®.  The  output  from 
SAS/Graph  was  sent  to  a  file  in  Hewlett-Packard®  Graphics  Language  (HPGL)  format  and  then 
placed  in  WordPerfect®.  HPGL  is  WordPerfect's®  recommended  export  format  for  SAS/Graph. 
To  create  the  HPGL  file,  the  GOPTIONS  statement  was  used  to  select  the  HP7475  plotter  driver  in 
SAS/Graph  and  to  direct  the  output  to  a  file  instead  of  the  default  serial  port: 

GOPTIONS  DEVICE=HP7475  GACCESS='SASGASTD>hpexport.plf; 

Using  PageMaker®  (the  desktop  publishing  software  demonstrated  in  the  workshop),  the  most 
suitable  SAS/Graph  export  format  for  placing  graphics  was  found  to  be  Encapsulated  Postscript 
(EPS).  With  other  export  formats  some  detail  is  lost,  the  scale  is  altered,  and  there  is  a  64K  file  size 
limit.  With  EPS  there  is  no  loss  of  detail,  the  scale  is  retained  even  when  reduced  considerably, 
Aid  there  is  no  file  size  limit  so  complex  graphics  are  not  restricted.  However,  EPS  graphics 
appear  as  a  shaded  box  on  the  monitor  because  EPS  is  a  language  readable  only  by  a  postscript 
printer,  and  a  postscript  printer  is  required  to  print  the  docximent.  Placing,  sizing  and  moving 
graphics  in  PageMaker®  is  very  simple  and  versatile,  accomplished  with  mouse  click  and  drag 
routines,  and  annotations  can  be  added  easily  around  and  within  the  shaded  graphics  box. 

SAS/Graph  output  can  also  be  exported  to  other  graphics  software  such  as  Harvard  Graphics®  and 
Lotus®  Freelance  for  further  modification.  The  preferred  export  format  in  this  case  would  be 
Computer  Graphics  Metafile  (CGM).  SAS/Graph  has  a  number  of  CGM  drivers  specific  the 
software  to  which  one  is  exporting.  Depending  on  the  graphics  software,  once  imported,  the 
SAS/Graph  output  can  now  be  edited  and  annotated.  Each  graphics  software  may  have  different 
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size  limits  on  the  CGM  file  that  it  can  import.  In  Harvard  Graphics®,  the  metafile  can  be  no  larger 
than  32K,  but  with  Zenographics  Mirage™  you  are  limited  only  by  the  memory  capacity  of  your 
system. 

With  the  various  options  available  for  SAS/Graph  export,  the  GREPLAY  procedure  is  useful  for 
storing  graphics  in  a  device  independent  catalog  and  replaying  them  later  with  the  appropriate 
device  driver  required  by  the  destination  software. 

Summary 

The  integration  of  statistical  analysis  and  graphics  in  SAS/Graph  along  with  its  data 
management  capacity  makes  it  a  powerful  tool  for  statistical  graphics. 

Regression  analysis  plots  and  plots  with  standard  deviation/standard  error  bars  are  requested  in 
the  SYMBOL  statement  with  the  interpolation  option  (I=optiQn)  in  GPLOT. 

SAS/Graph  has  many  options  available  for  enhancing  graphical  output  design  and  text.  If  further 
modification  is  required,  careful  selection  of  the  many  device  drivers  available  in  SAS/Graph 
can  ensure  that  graphics  are  transferred  correctly  to  the  destination  software. 
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A  DISCUSSION  OF  DISCRIMINANT  ANALYSIS  USING  DYSTOCIA  IN  BEEF  HEIFERS 
J.  A.  Basarab,  Beef  Management  Specialist 
Beef  Cattle  and  Sheep  Branch,  Animal  Industry  Division 
Alberta  Agriculture,  Edmonton,  Alberta 

Abstract 

Classification  of  subjects  into  two  or  more  groups  on  the  basis  of  one  or  more  numeric 
measurements  has  long  posed  a  problem  to  researchers.  The  most  commonly  used  approach  has 
been  to  apply  consecutive  numbers  to  the  groups,  treat  them  as  the  dependent  variable  and  then 
subject  the  dependent  variable  and  independent  variables  to  multiple  regression  analysis.  This 
approach  assumes  that  the  group  variable  is  continuous,  or  at  least  that  there  is  some  numeric 
interval  between  the  groups.  How  this  numbering  is  applied  can  dramatically  affect  conclusions. 
In  addition,  the  results  from  the  multiple  regression  analysis  lack  clarity.  The  analysis  does  not 
classify  subjects  into  groups,  but  merely  identifies  variables  which  significantly  affect  the 
dependent  variable.  Discriminant  analysis,  on  the  other  hand,  does  not  assume  numeric 
continuity  among  groups  and  classifies  subjects  into  discrete  groups  on  the  basis  of  a  battery  of 
measurements.  Discriminant  analysis  using  the  SAS  procedures  will  be  discussed  with  calving 
difficulty  in  beef  heifers  as  the  example. 
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APPLICATION  OF  TEDE  REPEATED  MEASURES  ANOVA  IN  AGRICULTURE 

L^  Goonewardene 
Research  Analyst,  Beef  Cattle  and  Sheep  Branch 
Alberta  Agriculture 
Edmonton,  Alberta 

Introduction 

In  the  agricultural  sciences  it  is  often  of  interesl^necessary  to  collect  multiple  observations  on  the 
same  samphng  unit.  If  the  example  is  a  plot  of  land  and  samples  are  taken  from  equal  sized 
strips  or  sub  plots  which  are  randomized,  the  design  is  called  a  split  plot.  The  common  notation 
is  to  call  the  larger  plot  the  'main  plot'  and  the  randomized  units  on  which  measurements  are 
made  the  'sub  or  split'  plots.  The  split  plot  is  a  very  efficient  design  in  soil/plant  experiments  due 
to  its  ability  to  use  space.  Furthermore,  compared  with  a  factorial  design,  where  plot  sizes  for  the 
different  combination  of  main  effects  are  equal,  the  split  plot  can  sub-divide  the  main  plot  into 
many  levels  of  splits  and  still  maintain  all  of  the  interactions  among  main  effects. 

Often  times,  multiple  measurements  are  taken  on  each  main  plot  at  different  times  and  such  a 
design  is  called  a  split  plot  in  time.  The  design  can  remain  a  'true'  split  plot  provided  the  time  at 
which  samples  are  taken  is  randomly  assigned  to  plots,  but  when  the  sampling  unit  is  an  animal 
or  individual,  time  often  becomes  fixed.  Thus,  where  levels  of  the  sub  or  split  plot  are  fixed  (i.e. 
not  assigned  randomly)  it  is  called  a  repeated  measures  design  (Pendergast  and  Littel  1988). 

The  objective  of  the  Workshop/Paper  was  to  work  through  the  analysis  of  a  split  plot  and  repeated 
measures  design  using  the  SAS^  system  and  compare  the  univariate  and  multivariate  analyses. 

HieSpUtPlot 

The  spHt  plot  analysis  is  a  univariate  type  where  there  is  only  a  single  dependent  variable  (Y) 
and  many  independents  (X^).  The  example  used  is  hypothetical,  where  12  animals  have  been 
subject  to  two  treatments  F  &  C  and  an  enzyme  labelled  as  G6P  sampled  at  four  periods  (time) 
designated  as  1,2,3  &  4. 
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SAS  CODE 
DATA  SPLIT; 

INPUT  ANIMAL  TREAT  $  PERIOD  G6P; 
CARDS; 


1 

F 

1 

14 

1 

F 

2 

12 

1 

F 

3 

16 

1 

F 

4 

11 

2 

F 

1 

12 

2 

F 

2 

16 

12 

C 

1 

21 

12 

C 

2 

22 

The  SAS  code  shown  above  is  the  usual  format  for  entering  data  designed  for  a  split  plot  type  of 
analysis  (in  columns).  However,  if  the  data  is  entered  for  a  multivariate  (repeated)  measures 
analysis  of  variance  in  rows,  then  some  modification  is  needed  to  get  it  into  a  split  plot  format. 
For  example,  if  the  data  were  read  in  as  G6P1  to  denote  G6P  at  Period  1,  G6P2  to  denote  G6P  at 
period  2  ,  and  so  on,  a  DROP  statement  and  four  OUTPUT  statements  would  be  needed  as  shown 
below  to  get  the  data  into  shape  (ie.  a  row  to  column  conversion)  for  analysis. 

SAS  Code 
DATA  SPLITl; 

INPUT  ANIMAL  TREAT  $  G6P1  G6P2  G6P3  G6P4; 

DROP  G6P1  -  G6P4; 

PERIOD  =  1;  G6P  =  G6P1;  OUTPUT; 

PERIOD  =  2;  G6P  -  G6P2;  OUTPUT; 

PERIOD  =  3;  G6P  =  G6P3;  OUTPUT; 

PERIOD  =  4;  G6P  =  G6P4;  OUTPUT; 

CARDS; 
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Once  the  data  are  in  columns,  either  PROG  ANOVA  or  PROG  GLM  can  be  invoked.  If  the  data 
are  balanced,  either  procedure  may  be  used,  but  if  the  data  has  missing  values,  it  is 
recommended  that  PROG  GLM;  be  used  and  least  squares  estimates  obtained. 

SAS  Code 
PROG  GLM; 

GLASSES  TREAT  ANIMAL  PERIOD; 

MODEL  G6P  =  TREAT  ANIMAL(TREAT)  PERIOD 

TREAT  *  PERI0D/SS3; 
TEST  H  =  TREAT  E  =  ANIMAL(TREAT)/ETYPE  =  3  HTYPE  =  3; 
MEANS  TREAT/E  =  ANIMAL(TREAT)  ETYPE  =  3  SNK; 
LSMEANS  TREAT/E  =  ANIMAL(TREAT)  ETYPE  =  3  STDERR; 
MEANS  PERIOD  TREAT  *  PERIOD/ETYPE  =  3  SNK; 
LSMEANS  PERIOD  TREAT  *  PERIOD/ETYPE  =  3  STDERR; 
RUN; 

The  above  program  should  generate  SAS  output  which  shows  the  'split  plot  type'  analysis  of 
variance  in  two  parts,  commonly  called  the  analysis  above  and  below  the  split.  The  variation 
associated  with  TREAT  is  usually  considered  as  being  above  the  split  or  main  plot ,  whereas 
PERIOD  and  interaction  of  PERIOD  and  TREAT  are  considered  below  the  split  or  sub  plot.  The 
SAS  System  by  default  tests  all  main  effects  such  as  TREAT  with  the  residual  mean  square 
(error  mean  square)  as  the  Model  is  fixed  and  only  the  residual  or  error  term  is  random. 
However,  in  a  split  plot,  the  correct  error  term  (based  on  expected  mean  squares)  for  testing 
TREA.T  is  ANIMAL(TREAT)  and  this  should  be  specified  in  the  program  using  a  TEST 
Statement.  It  is  also  necessary  to  specify  the  error  term  for  the  calculation  of  the  standard  error 
in  the  LSMEANS  Statement  A  type  III  sums  of  squares  is  used  in  all  cases. 
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The  SAS  output  (modified)  will  look  somewhat  like  this: 


Table  1  SAS  Printout  From  a  Typical  Split  Plot  Design 


DEPENDENT  VARIABLE  G6P 


SOURCE 

DF 

SS  MS 

F  VALUE 

PR>F 

R-SQ  C.V. 

MODEL 

17 

789.5  46.6 

13.05 

0.0001 

0.88  11.3 

ERROR 

30 

106.7  3.5 

ROOT  MSE 

G6P  MEAN 

CORR  TOTAL 

47 

896.3 

1.8 

16.6 

SOURCE 

DF 

TYPE  III  SS 

F  VALUE 

PR>F 

Comment* 

TREAT 

1 

500.5 

140.6 

0.0001 

Above  the  split* 

ANIMAL(TREAT) 

10 

268.0 

7.5 

0.001 

(incorrect) 

PERIOD 

3 

8.2 

0.77 

0.51 

Below  the  split* 

TREAT*PERIOD 

3 

12.7 

1.19 

0.32 

(correct) 

TEST  OF  HYPOTHESIS  USING  TYPE  III  MS  FOR  ANIMAL(TREAT) 

TREAT 

1 

500.5 

18.6 

0.001 

Above  the  split* 

(correct) 

*  Will  not  appear  in  SAS  output. 


As  you  may  notice,  the  terms  below  the  split  i.e.  PERIOD  TREAT*PERIOD  are  correctly  tested 
and  the  F  and  P  values  are  accurate  as  the  appropriate  error  term  has  been  used  for  testing. 
However,  above  the  split  the  analyst  should  only  use  the  F  value  of  18.6  and  P>F  of  0.001  as  the 
TEST  statement  has  now  directed  the  program  to  use  the  appropriate  error  term 
ANIMAL(TREAT)  for  testing.  Thus,  it  could  be  seen  that  TREAT  effects  are  significant 
P=0.001,  whereas  PERIOD  and  TREAT*PERIOD  is  not  significant  P>0.05.  In  addition,  the 
program  will  generate  means  with  a  Student  Newman  Keuls'  comparison,  least  squares  meains 
and  standard  errors,  for  TREAT,  PERIOD  and  TREAT*PERIOD,  as  they  have  been  specified  by 
the  MEANS  and  LSMEANS  statements. 
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The  Repeated  Measures  analysis 

The  Repeated  Measures  analysis  is  a  univariate  and  multivariate  ANOVA  and  assumes  that: 
at  least  one  factor  consists  of  multiple  measurements  taken  on  the  same  subject;  repeated 
measures  are  independent  across  subjects;  and  assumes  a  normal  multivariate  distribution 
with  a  common  covariance  matrix.  A  multivariate  normal  distribution  for  data  and  residuals 
could  be  checked  out  by  processing  the  data  through  PROC  UNIVARIATE  and  testing  for 
normality.  A  common  or  symmetrical  covariance  matrix  is  assumed  when  there  is  compound 
symmetry,  correlation  between  all  pairs  in  the  matrix  are  similar  and  equal.  This  is  called 
auto  correlation.  There  is,  however,  a  trend  in  animal  data  where  partial  correlations  are 
higher  when  time  intervals  are  closer  together  and  correlations  decrease  as  time  intervals 
widen. 

SAS  Code 
DATA  RPTD; 

INPUT  ANIMAL  TREAT  $G6P1  G6P2  G6P3  G6P4; 
CARDS; 


1 

F 

14 

12 

16 

11 

2 

F 

12 

16 

19 

20 

3 

F 

12 

15 

12 

11 

11 

C 

18 

17 

16 

20 

12 

C 

23 

26 

21 

22 

Note:  The  data  appears  in  rows  rather  than  columns. 
PROC  PRINT; 

TITLE  REPEATED  MEASURES  ANOVA'; 
RUN; 

PROC  GLM  DATA  =  RPTD; 
CLASS  TREAT; 

MODEL  G6P1  G6P2  G6P3  G6P4  =  TREAT; 
REPEATED  PERIOD  4  C0NTRAST(1)/SH0RT  PRINTM 
PRINTE  SUMMARY; 

RUN; 
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To  perform  both  the  univariate  repeated  measures  and  multivariate  analyses  in  a  single 
procedure  the  MODEL  statement  is  set  up  with  multiple  dependent  variables  (G€P1  -  G6P4)  to 
reflect  the  between  animal  design  and  the  REPEATED  statement  links  the  between  and  within 
subject  effects  ie.  eflFects  above  and  below  the  split.  The  resulting  SAS  print  out  will  provide 
univariate  analyses  of  variance  within  each  period  and  since  this  example  has  4  periods  there 
will  be  4,  one  Way  ANOVA'S,  the  Repeated  Measures  ANOVA  with  tests  of  hypothesis  for  between 
and  within  subject  factors. 

Table  2  SA.S  Print  out  TANQVA  onlv)  from  a  Repeated  Measures  ANOVA 


REPEATED  MEASURES  ANOVA 
GLM  PROCEDURE 

TESTS  OF  HYPOTHESES  FOR  BETWEEN  SUBJECTS  EFFECTS    -    Above  the  Split* 


SOURCE 

DF 

TYPE  III  SS 

MS 

F  VALUE 

PR>F 

TREAT 

1 

500.5 

500.5 

18.6 

0.001 

ADJUSTED  PR>F 

ERROR 

10 

268.0 

26.8 

G-G 

H-F 

PERIOD 

3 

8.2 

2.7 

0.77 

0.51 

0.48 

0.51  Below* 

PERIOD*TREAT  3 

12.7 

4.2 

1.19 

0.32 

0.32 

0.32  the 

ERROR(PERIOD)  30 

106.7 

3.5 

split 

G- 

■G 

EPSILON 

=  0.73 

H- 

-F 

EPSILON 

=  1.04 

*  These  comments  will  not  appear  in  the  SAS  output. 


A  comparison  with  the  values  in  table  1  shows  that  the  mean  squares,  F  values  and  P>F  are 
identical.  Therefore,  the  same  ANOVA  table  with  effects  above  and  below  the  split  are  obtainable 
either  through  a  split  plot  or  repeated  measures  analysis. 

There  is  one  rather  serious  limitation  using  the  REPEATED  statement.  Although  both  the  split 
plot  and  repeated  measures  can  derive  Means  for  TREAT  and  PERIOD  it  is  often  of  interest  to 
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obtain  Means  for  the  TREAT*PERIOD  interaction.  If  the  differences  between  the  treatments  are 
the  same  across  periods,  then  the  interaction  between  TREAT  and  PERIOD  will  not  be 
significant.  However,  if  this  interaction  is  significant  (P<0.05)  it  is  necessary  to  know  when  the 
difference(s)  occur.  Unfortunately,  the  TREAT*PERIOD  interaction  means  cannot  be  obtained 
in  a  repeated  measures  analysis  (using  a  REPEATED  statement)  and  thus  one  may  have  to 
revert  to  a  split  plot  type  of  analysis  to  obtain  interaction  means.  The  repeated  measures 
analysis  provides  partial  correlation  coefficients  among  the  repeated  measurements 
(G6P1-G6P4);  table  3.  One  could  use  this  to  see  if  auto  correlation  exists  in  the  matrix.  There  is 
often  a  tendency  for  the  correlation  to  decrease  as  the  time  interval  widens,  which  often  results  in 
the  violation  of  the  H-F/sphericity  condition  (to  be  discussed  later). 

The  PRINTM  option  prints  the  transformation  matrices  that  define  the  contrasts,  while  the 
PRINTE  option  instructs  the  SAS  System  to  print  the  error  sums  of  squares  and  cross  products 
matrix  associated  with  the  repeated  measurements  G6P1  to  G6P4,  as  well  as  the  sphericity  tests 
for  each  transformed  variable  that  are  orthogonal.  The  short  option  prints  the  multivariate  test 
criteria  and  associated  F  tests  in  a  condensed  form.  (See  SAS-Users  Guide  version  5  Ed  or 
SAS/STAT  6.03  for  details). 

Table  3  Partial  correlation  coefficients  from  the  error  SS  &  CP  matrix 
in  a  reneated  measures  analysis  (modified) 

PARTIAL  CORRE.  COEFF.  FROM  THE  ERROR  SS  &  CP  MATRIX/PRO B.lRl 


DF=9 

G6P1 

G6P2 

G6P3 

G6P4 

G6P1 

1.0 

0.71 

0.60 

0.45 

G6P2 

0.71 

1.0 

0.66 

0.55 

G6P3 

0.60 

0.66 

1.0 

0.79 

G6P4 

0.45 

0.55 

0.79 

1.0 

The  univariate  repeated  measures  =  standard  split  plot  analysis  provided  certain  conditions  are 
met.  The  first  is  of  compound  symmetry  ie.  when  correlations  between  all  pairs  of  repeated 
variables  is  equal.  This  condition  is  sometimes  stronger  than  required  and  is  called  the  Huynh  - 
Feldt  (H-F)  condition  in  SAS.  The  H-F  condition  is  met  only  if  all  possible  pairs  of  repeated 
measurements  have  equal  variances.  The  second  requirement  is  of  sphericity  ie.  where  any  set 
of  orthogonal  contrasts  have  equal  variances  and  zero  covariance.  The  validity  of  H-F  can  be 
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determined  by  a  Chi-Square  test  based  on  Mauchly's  sphericity  test  for  Othogonal  contrasts 
(Mauchley  1940,  Pendergast  and  Littel  1988).  If  the  sphericity  test  fails,  the  H-F  condition  is  not 
valid.  Inclusion  of  the  PRINTE  option  generates  a  Chi-square  value  for  Mauchley's  sphericity 
criterion.  If  the  X2  value  has  a  probability  <0.05  one  would  reject  the  null  hypothesis  of  sphericity 
and  conclude  that  the  H-F  and  sphericity  criteria  are  violated.  If  this  be  the  case  the  univariate 
analysis  of  variance  is  not  correct  for  the  within  subject  effects  of  PERIOD  and 
TREAT*PERIOD.  If  on  the  other  hand  the  sphericity  criterion  is  not  violated,  (acceptance  of  ho) 
then  a  split  plot  type  of  analysis  can  be  performed,  and  TREAT*PERIOD  means  compared  by  the 
usual  procedures. 

In  our  example,  Mauchley's  sphericity  for  orthogonal  contrasts  was  P=0.47,  thus  accepting  the 
null  hjrpothesis  of  sphericity.  As  such,  all  univariate  F  tests  are  valid,  and  there  is  no  need  to  use 
the  multivariate  or  adjusted  G-G,  H-F  tests. 

Univariate  and  Multivariate  Tests 

In  addition,  the  SAS  program  provides  multivariate  F  tests  for  PERIOD  and  TREAT*PERIOD 
interaction  (Table  4).  The  multivariate  tests  do  not  require  the  H-F  condition  or  sphericity  to 
hold  and  can  be  used  to  test  period  (time)  and  interaction  effects  in  the  split.  The  SAS  System 
provides  Wilk's  Lambda,  Pillai's  Trace,  Holelling-Lawley  Trace  &  Roy's  Maximum  Root  tests 
(all  multivariate  tests).  The  exact  F  tests  and  P>F  value  is  usually  given.  In  our  example  the  P 
values  are  0.72  for  PERIOD  and  0.19  for  TREAT*PERIOD  interaction  respectively,  which  are 
not  significant.  Furthermore,  in  our  example,  H-F  and  sphericity  is  not  violated  and  the 
univariate  F  test  probabilities  for  PERIOD  and  TREAT*PERIOD  are  0.51  and  0.32  respectively, 
all  "Being  non-significant.  It  could  thus  be  concluded  that  the  univariate  F  tests  are  therefore 
valid  for  the  within  subject  or  below  the  split  effects  in  our  example. 

There  are  times,  however,  when  the  multivariate  and  univariate  F  tests  do  not  match,  especially 
if  sphericity  H-F  condition  is  violated.  When  this  occurs,  one  can  either  accept  the  multivariate 
F  test  (which  usually  lacks  power)  or  use  the  a4justed  F  tests,  Greenhouse-Geisser  (G-G)  or 
Huynh-Feldt  (H-F)  tests.  These  two  tests  are  slightly  more  conservative  than  the  exact 
univariate  F  tests.  If  the  univariate  F  test  is  used  under  conditions  where  the  H-F  condition  is 
violated,  the  test  is  too  liberal  for  the  within  subject  effects  (PERIOD  &  TREAT*PERIOD),  and 
therefore  one  may  encounter  a  Type  I  error,  i.e.  rejection  of  the  null  hypothesis  when  in  fact  it  is 
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true. 


Table  4  Multivariate  tests  for  period  and  period  X  treatment  interaction 

(Modified) 


REPEATED  MEASURES  ANOVA 
GLM  PROCEDURE 

H  =  TYPE  III  SS  &  CP  MATRK  FOR  PERIOD: 

PERIOD.N  REPRESENTS  THE  CONTRAST  BETWEEN  THE  NTH  LEVEL  OF 

PERIOD  AND  THE  1ST 

DF=1  PERI0D.2  PERI0D.3  PERI0D.4 

PERI0D.2  4.08  4.66  8.16 

PERI0D.3  4.66  5.33  9.33 

PERI0D.4  8.16  9.33  16.33 


MANOVA  TEST  CRITERIA  AND  EXACT  F  STATISTICS  FOR  THE  HYPOTHESIS  OF 


NO  PERIOD  EFFECT.    H  = 

TYPE  III 

SS  &  CP 

MATRIX  FOR; 

PERIOD 

E  =  ERROR  SS  &  CP  MATRIX 

S  =  IM  =  0.5  N  =  3 

STATISTIC 

VALUE 

F 

NUM  DF 

DEN  DF 

PR>F 

WILKS*  LAMBDA 

0.85 

0.45 

3 

8 

0.72 

PILLAI'S  TRACE 

0.14 

0.45 

3 

8 

0.72 

HOTELLING-LAWLEY  TRACE 

0.16 

0.45 

3 

8 

0.72 

ROY'S  GREATEST  ROOT 

0.16 

0.45 

3 

8 

0.72 

MANOVA  TEST  CRITERIA  FOR  EXACT  F  STATISTICS  FOR  THE  HYPOTHESIS  OF 
NO  PERIOD*TREAT  EFFECT  H  =  TYPE  III  SS  &  CP  MATRIX  FOR  TREAT*PERIOD 
E  =  ERROR  SS  &  CP  MATRIX 


PR>F 


STATISTIC  VALUE  F  NUM  DF  DEN  DF 

WILKS'  LAMBDA  0.56  2.01        3  8  0.19 

PILLAI'S  TRACE  0.43  2.01        3  8  0.19 

HOTELLING-LAWLEY  TRACE  0.75  2.01        3  8  0.19 

ROY'S  GREATEST  ROOT  0.75  2.01        3  8  0.19 
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In  the  SAS  System,  both  the  G-G  and  H-F  estimates  of  Box's  E  are  available  through  PROC 
ANOVA  and  PROC  GLM,  providing  the  Probabilities  are  calculated  on  the  adjusted  degrees  of 
freedom.  Violation  of  the  H-F  condition  does  not  affect  the  between  subject  (TREAT)  effect 

The  repeated  measures  ANOVA  usually  handles  balanced  data  and  any  animal/subject  which 
does  not  have  a  complete  set  of  data  (repeated  measurements)  is  not  included  in  the  analysis. 
However  Milliken  and  Johnson  (1983)  and  Pareja  (1990)  provide  details  of  the  procedure  and  SAS 
code  to  handle  unbalanced  data.  Often  time,  the  sphericity  condition  can  be  violated  by  having  a 
large  number  of  repeated  measurements  on  the  same  animal.  Some  restriction  on  the  number  of 
repeated  measurements  may  be  helpful  in  conforming  to  sphericity  criteria. 

In  conclusion,  experiments  where  repeated  measurements  are  taken  on  the  same  subject,  can  be 
analysed  either  as  a  imivariate  (repeated  or  split  plot)  or  multivariate  (MANOVA)  type.  If 
H-F/Sphericity  is  not  violated,  the  univariate  approach  can  be  used.  Analysis  of  the  repeated 
measures  design  as  a  split  plot  in  time  is  particularly  useful  as  interaction  means  for  within 
subject  effects  can  be  computed.  Furthermore,  a  split  plot  in  time  will  handle  unbalanced 
designs  and  generate  least  squares  estimates,  provided  PROC  GLM  is  used.  If,  on  the  other 
hand,  the  H-F  condition  is  in  question,  either  a  Multivariate  approach  (Wilk's  Lambda)  or 
adjusted  Univariate  (G-G  &  H-F)  tests  should  be  used.  It  is  recommended  that  Univariate  tests 
be  used  when  appropriate,  as  it  is  more  powerful  than  the  Multivariate  tests.  All  these  conditions 
are,  however,  only  applicable  to  the  within  subject  (below  the  split)  effects. 
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STATISTICAL  ISSUES  IN  INSECT  POPULATION  RESEARCH 
G.  Bruce  Schaalje 
Agriculture  Canada  Research  Station 
Lethbridge,  Alberta 

Introduction 

Research  involving  insect  populations  as  the  experimental  units  presents  the  statistician  with 
many  interesting  challenges.  In  this  paper  I  discuss  some  of  the  characteristics  of  insect 
populations  that  must  be  taken  into  consideration  when  designing  experiments  and  analysing 
data,  and  review  some  of  the  statistical  methods  that  have  been  devised  for  work  with  insect 
populations. 

Characteristics  of  Insect  Populations 

Insects  are  Small 

While  this  statement  is  obvious,  its  implications  for  statistical  work  are  important.  One  of  the 
biggest  challenges  in  insect  population  research  is  in  estimating  the  size  of  insect  populations,  but 
because  insects  are  small  they  are  hard  to  find,  count,  and  identify  completely  as  to  species,  sex, 
and  growth  stage.  Furthermore,  their  size  often  renders  them  fragile  and  vulnerable  to 
microenvironmental  conditions.  As  a  result,  the  estimation  of  population  size  usually  involves 
shortcut  methods  such  as  incomplete  counts,  indirect  assessments,  lures,  mechanical  collection 
devices  (Southwood  1978),  and  data  aggregated  over  growth  stages,  sexes,  or  species  (Brillinger  et 
al.  1980).  Most  of  these  methods  require  complicated  statistical  models  to  obtain  estimates  and 
standard  errors,  and  for  calibrating  collection  devices  (Mcdonald  and  Manly  1989).  Data  from 
theselnethods  are  often  subject  to  large  unexplained  fluctuations  and  lots  of  variability. 

Insects  are  Poikilothermic 

Insect  development  and  activity  is  dependent  on  temperature,  and  thus  temperatures  must  be 
recorded  in  experiments  involving  population  change.  This  limits  the  use  of  "replication  in 
time"  in  experiments  using  natural  populations  (particularly  if  they  are  univoltine),  and  often 
necessitates  the  incorporation  of  mathematical  models  for  the  relationship  between  temperature 
and  development  into  statistical  analyses  (Curry  and  Feldman  1987). 
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Insects  are  Not  Stationary 

Insects  migrate  in  and  out  of  the  study  area  during  the  course  of  any  experiment  (Kuno  1991),  and 
often  change  their  location  in  plants  at  different  times  of  day.  This  movement  has  to  be  taken  into 
account  either  in  the  design  stage  of  the  experiments,  for  example  by  using  buffer  areas  around  the 
populations  and  sampling  at  specific  times  of  the  day,  or  in  the  analysis  stage  by  adjusting 
mathematically  for  migration.  This  may  necessitate  the  collection  of  covariates  to  indicate 
various  aspects  of  migration. 

The  Spatial  Distribution  of  Insects  is  Usually  Not  Random 

The  spatial  distribution  of  insects  is  generally  aggregated  to  some  degree  (Taylor  1984)  so  that 
simple  models  for  sampling  or  population  size  estimation  based  on  the  Poisson  distribution 
usually  do  not  give  good  results  (Taylor  1987).  Also,  insect  distributions  on  agricultural  fields 
usually  display  edge  effects  which  must  be  taken  into  account  in  designing  experiments. 

The  Appropriate  Spatial  Unit  for  Sampling  may  not  be  Obvious 

The  degree  of  aggregation  of  an  insect  population  may  be  an  artifact  of  the  quadrat  size  used  to 
sample  the  population  and  hence  may  have  little  meaning  unless  the  quadrat  size  is  "natural"  in 
some  sense.  A  standard  area  or  volume  may  not  be  appropriate  because  of  varying  numbers  of 
plants,  refugia,  etc.  (Regniere  et  al.  1989).  A  lot  of  thought  has  to  be  put  into  the  choice  of  a  sampling 
unit. 

Insects  Populations  have  an  Age  Structure 

This  has  to  be  kept  in  mind  in  insect  population  research  because  the  various  growth  stages  of  an 
insecThave  unique  physiologies,  behaviour,catchabilities,  survivorship,  etc.  The  different  growth 
stages  will  react  to  treatments  differently  and  thus  the  age  structure  of  the  population  has  to  be 
either  controlled  in  the  experiment  or  observed  and  adjusted  for  mathematically  (Schaalje  et  al. 
1989). 

Insects  Develop  Fast 

As  a  result  of  this,  experiments  on  insect  populations  usually  have  to  involve  repeated 
measurements,  pretreatment  assessments  of  the  populations,  and  control  populations.  Often 
apparent  treatment  effects  have  to  be  adjusted  for  population  changes  (mortality,  reproduction, 
development)  that  would  have  occurred  naturally  during  the  experiment  (Abbott  1925). 
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The  Population  Structure.  Habitat,  and  Treatments  Effects  are  Dynamic 
All  three  of  these  components  of  an  insect  population  experiment  have  independent  dynamics 
which  may  interact  in  complicated  ways.  The  analysis  and  interpretation  of  data  from  these 
experiments  often  have  to  be  based  on  a  detailed  population  model  as  well  as  models  for  the 
environment  and  the  treatment  (Schaalje  et  al.  1989). 

Statistical  Metbods  for  Insect  Population  Researdi 

Sampling  Methods 

Many  techniques  of  finite  population  sampling  are  important  in  estimating  the  size  of  insect 
populations.  Stratified  sampling,  double  sampling,  and  multistage  cluster  sampling  (Southwood 
1978)  are  all  useful  in  sampling  insect  populations. 

Binomial  sampling  (Schaalje  et  al.  1991,  Nyrop  and  Binns  1991)  by  which  population  density  is 
estimated  using  presence-absence  data  from  a  random  sample  of  units  is  very  important  in 
sampling  small,  highly  aggregated  insects.  Measurement  error  models  (Fuller  1987)  are 
necessary  in  getting  unbiased  prediction  equations  for  binomial  sampling,  and  for  calculating 
appropriate  estimates  of  the  standard  error  of  prediction. 

Sequential  sampling  (Nyrop  and  Binns  1991,  Kuno  1991)  has  been  applied  extensively  to  pest 
management  applications  so  that  as  few  samples  as  possible  can  be  taken  to  determine  whether  the 
population  has  reached  an  economic  threshold. 

Mart"->ecapture  methods  in  which  the  population  is  assumed  to  be  open  to  migration,  and  which 
involve  multiple  releases  but  a  single  recapture,  have  recently  been  found  to  be  useful  in 
estimating  the  population  size  and  survival  for  certain  species  of  insects  (Bumham  1989,  Lysyk 
andAxtell  1986). 

McDonald  and  Manly  (1989)  discuss  how  biased  sampling  procedures  arise  in  insect  population 
sampling  and  suggest  methods  for  their  calibration.  Finally,  ratios  often  arise  when  using  data 
from  sampling  devices  to  estimate  population  size,  and  Buonaccorsi  and  Liebhold  (1989)  discuss 
the  unbiased  estimation  of  such  ratios. 


67 


Spatial  Analysis 

Various  statistical  models  for  the  spatial  distribution  of  insects  (Southwood  1978)  have  been 
suggested  and  used,  most  notably  the  negative  binomial  distribution.  In  addition,  indices  of 
aggregation  have  been  developed  and  discussed  (Hurlbert  1990).  General  models  for  the 
relationship  between  the  mean  and  the  variance  of  populations  for  a  given  species  (Taylor  1984, 
Iwao  1976,  Kuno  1991)  are  useful  in  characterizing  insect  species  and  predicting  their  spatial 
distribution  at  a  given  density.  Binns  (1986)  discusses  the  relationship  between  Taylor's  variance- 
mean  model  and  the  negative  binomial  distribution.  The  relatively  new  field  of  geostatistics  has 
been  applied  to  insect  populations  to  provide  the  ability  to  predict  population  densities  at  any 
particular  location  in  a  field  (Schotzko  and  O'Keeffe  1989),  and  on  a  larger  scale  geographical 
information  systems  are  providing  similar  capabilities  (Johnson  1989). 

Stage-Freouencv  and  Related  Demographic  Analvses 

Several  methods  have  been  developed  for  estimating  stage-specific  development  times,  mortality 
rates,  and  reproductive  rates  of  insects  fi-om  a  sequence  of  cross- sectional  samples  of  insect 
populations  (Manly  1989).  These  methods  differ  as  to  the  type  of  data  collected  (non-overlapping 
cohorts,  multiple  cohorts,  etc.),  the  type  of  information  desired,  and  the  assumptions  of  the  models 
upon  which  the  methods  are  based.  In  addition,  "key-factor  analysis"  (Varley  and  Gradwell  1960, 
Manly  1989)  provides  a  method  for  analysing  stage-fi-equency  data  collected  over  several 
generations  and  determining  the  relative  contribution  to  population  dynamics  of  mortality  in  the 
various  growth  stages.  Finally,  Carey  (1989)  discusses  how  traditional  demographic  methods  can 
be  applied  to  the  study  of  insect  population  dynamics. 

Stochastic  Modelling  of  Population  Dynamics 

Much  statistical  analysis  of  insect  population  data  involves  fitting  a  stochastic  model  of  population 
dynamics  to  the  data.  Brillinger  et  al.  (1980)  developed  a  stochastic  difference  equation  to  model 
aggregate  insect  data  and  investigate  density-dependent  mortality.  Blough  (1989)  used  time 
series  methodology  in  connection  with  a  difference  equation  model  in  his  analysis  of  insect 
population  changes  due  to  a  pesticide.  Dennis  (1989)  discussed  the  use  of  stochastic  differential 
equation  models  for  insect  populations.  Schaalje  and  van  der  Vaart  (1989)  reviewed  stage-specific 
population  models  which  allow  for  variability  in  developmental  rates,  and  Schaalje  et  al.  (1989) 
applied  such  a  model  to  the  analysis  of  field  data  on  a  population  sprayed  with  a  pesticide.  Curry 
and  Feldman  (1987)  discuss  several  issues  and  strategies  important  in  modelling  and  analysing 
insect  populations. 
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EXPERIMENTAL  DESIGN:  BASIS  FOR  SOUND  RESEARCH  METHODS 

L.  Zack  Florence 
Animal  Sciences  Division,  Biometrics  Section 
Alberta  Environmental  Centre 
Vegreville,  Alberta  TOB  4L0 

InlzDdviction 

Well-planned  experiments  increasingly  form  the  basis  for  cost  effective  research.  Few  people 
today  would  argue  that  budgeting  time  and  money  does  not  largely  dictate  whether  an  experiment 
will  be  attempted.  For  those  using  humans  or  other  animals  as  experimental  subjects,  the  care, 
welfare  and  ethical  use  of  research  subjects  is  the  first  consideration:  choosing  sample  sizes  takes 
on  a  new  meaning  and  has  drawn  renewed  attention  to  power  analysis  during  project  planning 
(  Mann  et  al.  1991). 

Most  of  us  encountered  our  first  meaningful  experimental  design  experience  at  work  or  in 
graduate  school.  This  was  also  our  first  introduction  to  planning  research  in  committee.  It 
brings  back  some  questionably  fond  memories  for  many  of  us  and  will  be  left  at  that.  The  point  to 
be  made  here  is  that  few  people  do  research  alone;  most  of  us  must  work  in  an  environment  where, 
in  order  to  meet  our  project  objectives,  we  must  help  other  scientists  meet  theirs.  This  paper  will 
discuss  a  few  of  the  topics  that  must  be  considered  when  planning  a  co-operative  experiment.  It 
will  be  assumed  that  even  if  you  are  not  involved  in  the  committee  approach  to  doing  research  that 
you  are  at  least  in  contact  with  a  statistician  or  someone  in  whom  you  may  confide  (a  person(s) 
who  helps  choose  the  proper  model  and  the  appropriate  design,  and  makes  inferences  based  upon 
the  results  of  analyses).  This  paper  is  a  short  discussion  of  the  process;  several  good  texts  such  as 
Box  et  al.  (1978),  Cox  (1958)  and  Anderson  and  McLean  (1974)  may  be  consulted  for  more  in-depth 
discussions  and  underlying  theory.  The  one  mistake  all  of  us  hope  to  avoid  when  planning 
experiments  is  what  A.W.  Kimball  (Kimball  1957,  cited  by  Box  et  al)  called,  "error  of  the  third 
kind",  that  is,  obtaining  the  right  answer  to  the  wrong  problem. 
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Ordered  List  of  Steps  in  Planning  an  Experiment 

Anderson  and  McLean  (1974)  prepared  a  list  of  12  steps  that  should  be  followed  when  planning  an 
experiment.  I  have  added  my  comments  and  examples  to  their  list  (as  follows).  When  moving 
through  the  list,  keep  in  mind  that  it  is  implied  that  a  statistician  is  involved  in  this  process  from 
the  beginning. 

1.  Reco^ition  that  a  problem  exists 

-  The  problem  is  usually,  but  not  always,  apparent  and,  if  a  group  or  committee  is  involved, 
it  is  important  to  have  consensus. 

2.  Formulation  of  the  oroblem. 

-  Uninhibited  discussion  among  participants  in  a  group  situation  is  healthy.  If  it  is  not  a 
group  project,  it  is  wise  to  seek  the  experience  of  others  in  the  same  area  of  research.  In 
either  case,  a  collaborative  effort  will  make  it  easier  to  come  to  a  decision  about  the  most 
likely  problems  or  subject  areas  requiring  research  effort.  Do  not  be  surprised  if  it  takes 
more  effort  than  you  expected  to  identify  the  objectives  of  an  experiment. 

3.  Agreeing  on  factors  and  levels  to  be  used  in  the  experiment 

-  In  step  2,  you  identified  a  probable  cause(s),  and  you  are  now  ready  to  identify  treatments 
or  factors  (independent  variables)  that  you  can  vary  (levels)  and  measure  (dependent 
variables  such  as  response,  jaeld  or  product ).  Continuous  independent  treatment  factors 
will  require  different  assumptions  than  discrete  discontinuous  ones. 

4.  Specifying  the  variables  to  be  measured. 

-  Quantitative  responses  by  dependent  variables  are  usually  more  apparent  than  are 
qualitative  ones;  the  latter  responses,  are  not  metric  and  are  at  best  ordinal.  The 
decision  to  use  one  or  the  other  is  very  important. 

5.  Definition  of  the  inference  space  for  the  problem 

-  This  decision  determines  how  far  you  may  legitimately  extend  the  results  of  the 
study;  for  example,  if  an  experiment  is  bounded  by  20  and  60  degrees  C,  you  are  limited  to 
this  space  and  the  inferences  are  bound  by  choosing  the  proper  error  terms  when  fitting  the 
model. 

6.  Random  selection  of  the  experimental  units 

-  The  experimental  unit  is  the  material,  area,  time,  plot,  pot  or  whatever,  which  will 
receive  the  treatment.   In  animal  experiments,  the  experimental  unit  may  be  a  pen  of 
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animals,  with  the  same  randomly  assigned  treatment  applied  to  all  animals  in  the 
pen. 

-  The  experimental  unit  should  be  representative  of  the  inference  space,  that  is,  growth 
trials  done  in  a  greenhouse  may  have  little  relationship  to  the  field  environment. 

-  The  number  of  experimental  units  determines  the  standard  error  or  precision  which 
will  be  associated  with  the  inferences  you  wish  to  make  from  your  results;  keeping  the 
standard  error  low  may  require  blocking  (e.g.,  spatially,  temporally,  by  treatment)  to 
avoid  systematic  bias. 

7.  Assignment  of  treatments  to  the  experimental  units 

-  Each  experimental  unit  (plot,  cow,  beaker,  petri  plate)  should  have  equal  chance  of 
receiving  each  treatment. 

-  Randomization  (of  error)  is  the  basis  for  the  design  of  the  experiment,  for  example,  in  a 
randomized  complete  block,  split-plot  experiment,  main  plots  are  randomized  within 
blocks,  and  sub-plots  are  randomized  within  main  plots,  within  blocks. 

8.  Outline  of  the  analvsis  corresponding  to  the  design  before  the  data  are  taken 

-  Quoting  form  Anderson  and  McLean  (1974)  pp.  87-88:  "At  this  point,  the  statistician  must 
write  down  the  mathematical  model  that  has  evolved  as  a  result  of  the  committee  activity 
in  the  preceding  sections.  This  mathematical  model  will  give  rise  to  the  ANOVA  table. 
The  ANOVA  table  will  now  consist  of  the  degrees  of  fireedom  and  the  expected  mean 
squares  for  each  of  the  specific  factors...".  The  point  made  here  is  that  the  design  and  the 
model  to  which  data  are  to  be  fit  go  hand-in-hand,  and  the  tests  of  interest  are  determined 
before  the  experiment  is  done.  This  point  is  discussed  further  in  the  following  section. 

9.  Collection  of  the  data 

-  Too  many  experiments  begin  here! 

-  Quality  control  and  quality  assurance  are  paramount  at  this  stage;  several  people  may  be 
collecting  data  during  the  course  of  a  project  and  someone  should  be  in  charge  of  regularly 
validating  data  and  collection  methods. 

-  Data  forms  should  be  simple  and  straightforward —  have  a  trial  run  to  make  sure 
everyone  understands  how  the  forms  are  to  be  used. 

10.  Analvsis  of  the  data 

-  In  this  step  you  should  plot  means  and  individual  data  in  scatterplots,  calculate 
descriptive  statistics  and  use  statistical  packages  which  will  test  for  normality  and 
produce  normal  probability  plots. 

-  Ask  yourself:  Are  parametric  assumptions  met,  that  is,  are  variances  equal;  are 
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variances  independent  of  the  means;  are  transformations  needed? 

-  If  you  have  decided  upon  the  correct  model  during  the  design,  applying  the  correct 
analysis  should  be  fsdrly  straightforward  because  you  have  previously  decided  how 
the  data  are  to  be  analyzed  and  which  tests  are  appropriate.  Beware!  Unless  you  specify 
otherwise,  all  statistical  software  packages  assume  all  effects  in  your  model  are  fixed. 
This  assumption  can  lead  to  spurious  conclusions  regarding  tests  for  significance. 

11.  Conclusions 

-  Once  again,  if  this  is  a  group,  interdisciplinary  effort,  you  will  likely  need  to  get  together 
and  discuss  the  results  of  the  analyses.  Experience  suggests  that  this  can  be  a  very 
valuable  exercise:  many  times,  leaving  interpretation  of  results  to  one  person  may  lead  to 
error,  and  worthwhile  information  may  be  overlooked. 

12.  Implementation 

-  Time  now  for  a  "management"  decision:  do  you  alter  a  manufacturing  process  based 
upon  the  results,  publish  the  results  or  wait  until  additional  study  is  completed;  can  you 
implement  a  technology  transfer  program  to  agricultural  producers  based  on  your  study's 
conclusion?  The  ease  with  which  these  decisions  can  be  made  will  be  contingent  upon 
how  closely  you  have  adhered  to  a  process  like  the  one  discussed  here. 

-  If  an  experiment  has  been  based  on  prototypes  or  bench  techniques,  it  will  be  time  to  think 
of  applying  what  you  have  learned  to  "scale  up",  which  for  example,  to  a  crop  scientist 
means  doing  field  trials  to  determine  how  well  greenhouse  or  lab  experiments  fit  the  real 
world. 

From  Design  to  Model  to  Analysis 

Ideally^  the  statistical  model  to  which  we  hope  to  fit  the  data  should  have  been  determined  by  the 
time  you  have  reached  step  number  8.  This  is  not  always  the  case  and  too  often  the  very  close 
linkage  between  the  experimental  design  and  the  model,  and  how  data  will  be  analyzed  and 
average  differences  tested,  are  not  realized.  A  simple  illustration  can  serve  to  demonstrate  the 
associations. 

Let  us  assume  you  have  identified  two  experimental  factors,  a  and     which  you  wish  to  vary 
(Figure  1).  Treatments  composed  of  factorial  combinations  of  the  levels  (i,  j,  respectively  )  of  a 
and  p  are  randomly  assigned  to  experimental  units,  which  are  arranged  in  a  completely 
randomized  design.  You  want  to  measure  the  response  (Y)  due  to  main  effects  of  a  and  p  and  their 
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FIGURE  1.  Two  factor  (a,  P),   full  factorial  experiment  showing  changes 
among  expected  mean  squares  (EMS)  conditional  upon  assumptions  about  a 
and  p. 


The  Model 

=u  +  ai  +  pj  +  apq  +  £(ij)k 
If  the  Experiment  calls  for  both  a  and  p  factors  to  be  fbced: 


Source  EMS 

ai  a2+pne(a) 

pj  a2  +an0(p) 

apij  a2+ne(ap) 

e(ij)k  cj2 

If  the  Experiment  calls  for  both  a  and  P  to  be  random: 

Source  EMS 


ai  a2  +  n  c^a'^  +  bn  a2(^ 

p  j  a2  +  n  a2(xp  +  an  G2p 

apij  a2  +  na2(xp 

e(ij)k 

If  ^e  Experiment  calls  for  a  to  be  Random  and  p  fixed: 

Source  EMS 


ai  a2  +  bn  c^q^ 

pj  a2  +  n  a^Q,^  +  an0(P) 

apij  a2  +  na2(xp 

e(ij)k 
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interaction,  o^.  How  these  effects  in  your  model  are  to  be  tested  for  significance  depends  upon  the 
expected  mean  squares  [EMS;  see  Anderson  and  McLean  (1974)  for  a  description  of  the  EMS 
algorithm]. 

Referring  to  Figure  1,  assume  you  decide  that  factors  a  and  j3  are  "fixed",that  is  ,  you  wish  to  set 
each  factor's  levels  at  prescribed  values  and  these  explicitly  define  the  extent  to  which  you  may 
make  inferences  within  the  design  space.  Recall  that  number  10  stated  that  all  commercial, 
statistical  software  makes  the  assumption  that  all  effects  in  a  model  are  fixed.  The  EMS  reflect 
this  assumption  in  Figure  1:  to  do  an  F-test  of  each  term  in  the  fixed  model,  each  mean  square 
estimate  should  be  divided  by  the  mean  square  error ,  e.  Note  that  each  effect  attributed  to  a  and  p 
can  then  be  uniquely  estimated  by  this  ratio  of  effect  to  error. 

Note,  however,  if  during  the  planning  of  our  experiment  we  wished  to  make  inferences  not  to  a 
very  limited  number  of  explicit  levels  (and  fixed  contributions  of  a  and  P  to  the  overall  mean 
effects),  but  to  a  broader,  more  universal  variation  in  all  a  and  P  (called  a  random  effects  model). 
The  EMS  would  take  on  a  different  look(  note  that  fixed  treatment  effects,9,  are  now  replaced  by  c^, 
denoting  variance  effect).  Now  there  are  two  error  terms  we  must  use  to  test  for  treatment  effects. 
The  op  interaction  is  used  as  the  error  term  to  test  main  effects  (a  and  P)  because  it  shares  the  same 
EMS  components,  except  for  that  contributed  only  by  the  a  or  p  treatment  effect.  The  random  error 
mean  square,e,  is  now  used  only  to  test  the  significance  of  the  aP  interaction. 

In  the  third,  and  final  combination,  called  a  "mixed"  model,  there  are  two  factors,  one  fixed  and 
one  random  (Figure  1).  After  calculating  the  EMS,  we  are  able  to  see  that  the  random  error,  e,  can 
be  used  to  estimate  the  effect  of  a  and  the  oP  interaction,  but  the  oP  interaction  is  appropriate  to  test 
the  efleit  due  to  p. 

Obviously,  from  this  simple  illustration,  decisions  need  to  be  made  up-front  as  to  how  well  the 
design  and  model  conform  and  whether  our  research  objective(s)  can  be  met  by  the  appropriate 
tests  of  terms  in  the  model.  In  addition  to  the  source  already  mentioned,  a  good  explanation  of 
these  topics  can  be  found  in  Snedecor  and  Cochran(1967,  ch.  12). 
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Summary 


This  paper  has  provided  a  simple  schematic  ouUine  of  12  steps  that  ought  to  be  considered  during 
the  design  and  analysis  of  an  experiment.  If  more  effort  is  expended  in  the  early  states  of  a 
research  project,  the  later  stages  —  defining  the  model,  fitting  the  response  data,  testing  for 
treatment  effects,  and  publicly  presenting  the  results  —  can  be  done  with  greater  confidence. 
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PARAMETRIC  ASSXJMPTIONS 
Bob  Hardin 
Department  of  Animal  Science 
University  of  Alberta,  Edmonton,  Alberta 

Abstract 

Statistical  hypothesis  testing  of  the  linear  model  is  based  on  assumptions  that  the  data  are 
normally  distributed.  In  many  situations,  both  in  teaching  and  the  real  world,  this  is  simply 
assumed  to  be  true.  A  problem  is  how  to  incorporate  into  a  graduate  biometrics  course  the  routine 
testing  of  these  assumptions.  The  approach  presently  being  incorporated  into  the  course  is  the 
computation  of  residuals  from  the  predicted  linear  model  and  the  use  of  these  residuals  in  the 
computation  of  sample  statistics  and  normal  probability  plots.  The  GLM,  UNIVARIATE  and 
RANK  procedures  of  SAS  will  be  used  to  perform  the  computations.  Examples  will  be  presented  to 
illustrate  the  approach. 
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STATISTICAL  PROBLEMS  IN  COMPLIANCE  ASSESSMENT 
Albert  J.  Liem 
Air  and  Waste  Management  Branch 
Alberta  Environmental  Centre 
Vegreville,  Alberta 


Introduction 

This  note  is  a  condensed  description  of  a  paper  already  published  elsewhere  (Liem  and  Wilson 
1991).  Problem  definition  is  emphasized,  but  details  of  the  solution  have  been  omitted,  with  the  hope 
of  not  compromising  accuracy  or  clarity  for  sake  of  brevity.  The  purpose  of  presenting  this  note  is 
to  introduce  the  subject  the  reader,  who  may  find  it  either  useful  in  other  applications  or  at  least 
academically  interesting. 

Description  of  the  problem 

The  problem  is  quantifying  the  confidence  of  proving  that  regulatory  compliance,  given  by  the 
following  equation,  is  met: 

y,  <  y.  (1) 

where  yr  is  the  measured  value  of  the  regulated  parameter  and  yc  is  the  compliance  level.  Many 
regulatory  standards  can  be  expressed  as  above,  and  assessing  compliance  is  not  a  trivial  problem 
when: 

*the  accuracy  and  precision  of  the  method  for  measuring  y,  Qierein  referred  to  as 
variability)  are  not  known,  and 

'''some  or  all  the  reported  values  are  expressed  as  less  than  detection  limits  or 
nondetectable.i 

Such  is  the  case  with  the  results  of  many  incinerator  test  bums,  as  shown  in  Table  I. 

^  A  further  complicating  factor  is  when  the  detection  limit  is  not  unique,  but  can  be  varied  by 
adopting  different  sampling  or  sample  processing  strategies,  and  hence  it  can  be  made  as  close  as 
possible  to  the  compliance  level;  see  original  paper. 
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In  all  the  cases  shown,  the  reported  values  are  lower  than  the  compliance  levels.  Is  it  valid  to 
interpret  such  evidence  as  a  definitive  proof  that  compliance  is  met?  Our  contention  is  NO.  The 
difference  between  the  reported  values  and  the  compliance  levels  and  the  variability  of  yr  should 
both  be  taken  into  account.  Intuitively,  one  expects  that  the  larger  the  difference  and  the  smaller 
the  variability,  the  more  assured  one  is  that  compliance  is  actually  met. 

Consider  the  accompanying  results  of  a  surrogate  spiking  program  that  was  implemented  to 
address  the  issue  of  variability.  In  the  process  of  measuring  yr,  the  'analyst'  was  given  certain 
quantities  (unknown  to  the  analyst)  of  surrogates  and  requested  to  report  those  quantities.  The 
recovery',  defined  as  ratio  of  the  reported  to  the  actual  values,  ranged  fi-om  10%  to  300%.  Thus,  on 
the  assumption  that  the  surrogates  and  the  regulated  compounds  are  similar,  the  actual  values  of  yr 
could  be  as  high  as  ten  times  the  reported  values  shown  in  Table  I.  Surely  in  Case  C  one  cannot  be 
very  sure'  that  compliance  is  met;  but,  one  can  be  'more  sure'  that  compliance  is  met  in  Case  B. 
As  previously  stated,  the  problem  is,  therefore,  how  to  quantify  the  confidence  of  concluding  that 
compliance  is  met.  It  is  a  statistical  problem. 

Table  I.  Sample  Incinerator  Test  Bum  Results* 


Case 

y. 

y. 

A 

2.6 

46 

4J, 

43 

as 

44 

B 

<44 

3S1 

<26 

no 

<2l 

419 

C 

<14 

26 

<21 

52 

<22 

44 

*  Actual  results  from  the  Alberta  Special  Waste  Treatment  Centre  (ASWTC)  near  Swan  Hills. 
Both  values  are  expressed  in  mg/h  of  emissions  of  regulated  compounds. 

Note  that  it  would  be  conceptually  erroneous  to  use  the  results  in  Case  A  for  computing  confidence 
intervals,  fi-om  which  statistical  inferences  regarding  compliance  being  met  are  drawn.  One 
simple  reason  is  that  the  bias  or  the  accuracy  of  the  method  is  not  known  from  those  results  alone. 
Incidentally,  that  approach  could  not  be  used  when  nondetectable  results  are  obtained,  as  in  Cases 
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B  and  C. 
Solution 
Outline 

There  are  three  elements  in  the  solution: 

*  Obtaining  the  variability  of  the  method  for  measuring  y,.:  In  the  case  of  incinerator  test 
bums,  this  can  be  indirectly  obtained  from  a  surrogate  spiking  program.  Other  cases  may 
require  different  schemes.  The  premise  is  that  the  surrogates  and  the  regelated 

parameters  are,  in  terms  of  variability,  identical. 

♦Treatment  of  nondetectable  results:  From  the  point  of  view  of  the  'regulator',  to  whom 
compliance  must  be  proven,  nondetectable  results  can  be  assigned  the  detection  limit 
values.  Thus,  a  'conservative'  approach  is  used. 

*  Statistical  analysis:  The  premise  is  that  measurement  results  are  distributional.  That 
is,  given  a  value  of  the  regulated  parameter,  repeated  measurement  results  can  be 
described  by  a  probability  density  function.  The  Bayesian  approach  for  hjrpothesis  testing 
can  then  be  used. 

The  implementation  of  a  surrogate  spiking  program  or  other  programs  can  be  quite  involved  and 
hence  will  not  be  described  here.  The  only  aspect  that  will  be  described  is  the  statistical  analysis, 
starting  from  the  point  where  the  variability  of  the  measurements  of  yr  has  been  obtained. 

Statistical  Analysis 

Bayes'  theorem.  Compliance  assessment  can  be  formulated  as  hypothesis  testing.  The  first 
hypothesis  is  Hi:y>yc  -  compliance  is  violated  -  and  the  second  is  the  alternative  H2:y<yc  - 
compliance  is  met  -  where  y  is  the  actual  value  of  the  regulated  parameter. 

Given  that  a  value  of  yr  is  obtained  (or  a  series  of  values  of  yrk,  where  subscript  k  represents  the  kth 
measurement),  what  are  the  probabilities  of  Hi  and  H2  being  true?  Bayes'  theorem  is: 

P[HJy^  «  P\yJH]iP[Hji  >u  (2) 

where  *  P[Hj]  is  the  a  priori  probability  of  Hj  being  true,  the  degree  of  belief  in  Hj  beforfi  the 
evidence  is  gathered, 

*  P[y,yHj],  referred  to  as  a  conditional  probability,  is  the  probability  of  obtaining  yr  if 
indeed  Hj  is  true, 

*  P[Hj/yr]  is  the  a  posteriori  probability  of  Hj  being  true,  the  revised  degree  of  belief  in  Hj 
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after  the  evidence  is  gathered. 
The  degree  of  belief  in  a  hypothesis  is  revised  by  the  evidence  gathered  and  quantitatively 
modified  in  proportion  to  the  probability  of  that  evidence  being  obtained  if  indeed  the  hypothesis  is 
true.  This  revision  can  be  continually  updated  in  a  series  of  measurements,  where  the  a  posteriori 
probability  of  one  set  of  measurements  is  used  as  the  a  priori  value  for  the  next. 

The  proportionahty  constant  in  Eq.  (2)  can  be  eliminated  by  noting  that  Hi  and  H2  are 
complementary,  or  by  using  the  concept  of  odds  R,  defined  by  the  ratio  of  the  two  probabilities.  Eq. 
(2),  written  in  terms  of  odds,  becomes: 

R[C/y;i  =  R\yJC\R[C]  (3) 

where  C  denotes  compliance  being  met,  that  is,  R[Cl=P[H2]/P[Hi]  and  the  conditional  and  a 
posteriori  odds  are  similarly  defined. 

Two  inputs  are  thus  required.  The  first  is  the  a  priori  odds  of  compliance  being  met.  A  value  of  one, 
expressing  no  prior  knowledge  whether  compliance  is  more  or  less  likely  to  be  met,  seems  a 
reasonable  value.  The  second  is  the  conditional  probability,  which  requires  statistical  models  or 
assumptions  and  as  described  below. 

Assumptions.  The  following  assumptions  are  needed,  but  the  actual  'models'  or  equations  used 
can  be  changed  to  suit  the  situation: 

(1)  Probability  density  fiinction  of  obtaining  yr  given  that  the  actual  value  is  y.  A  simple 
function  is  log-normal,  with  constant  variance: 

=  (4) 

where  2  is  the  variance  (derived  firom  surrogate  spiking  results)  and  the  upper  case  Y 
denotes  log-transformed  values. 

(2)  The  density  function  of  the  a  priori  probability  or  odds.  The  simplest  function  is  a 
constant  density. 

Computation  of  conditional  probability.  Consider  the  conditional  probability  of  obtaining  y^.  under 
Hi,  that  is  compliance  is  not  met,  or  y>yc.  This  can  be  readily  computed  for  the  'models'  used  in  the 
above  assumptions.  It  is  simply  the  integral  of  Eq.  (4)  with  respect  to  Y,  from  Y=Yc  to  Y=». 


^y^J  '  Jexp-^^^^^dY  (5) 
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It  can  verified  that 

P\yJH{\  =  *(^^)  =  ♦(Zp  (6) 
o 

where  (Zy)  is  the  'tail  area'  of  the  standard  normal  distribution  (zero  mean  and  unit  variance)  to 
the  left  ofZr=(Yr-Yc)/ . 

A  posteriori  odds  of  compliance  being  met.  In  a  series  of  measurements  the  degree  of  belief  is 
continually  updated  as  more  evidence  is  gathered.  Thus  at  the  end  of  M  sets  of  measurements,  the 
odds  of  compliance  being  met  is: 

RJ-<^/yrM^  =  (ft  ^it^  ^'^ 

where  the  conditional  odds  in  the  j^h  measurement  is: 

As  discussed  previously,  Ri[C]=l  is  a  reasonable  value,  the  values  of  ,  Yc  and  Yr  are  obtained  from 
the  surrogate  spiking  and  test  bum  results,  can  be  found  from  statistical  tables  and  hence  the 
final  a  posteriori  odds  of  compliance  being  met  can  be  readily  computed. 


Results  and  Discussion 

The  method  was  applied  to  a  series  of  test  bums  conducted  at  the  ASWTC.  A  complete  presentation 
would  be  too  lengthy  since  there  were  'complications',  such  as  the  presence  of  two  emission 
sources.  The  condition  for  compliance  can  no  longer  be  expressed  as  given  in  Eq.  (1),  but  is  in  the 
following  form: 

where  subscript  s  denotes  the  s^^  emission  source,  NS  is  the  number  of  emission  sources  and  g  is  the 
weighting  factor  for  emission  source  s. 

For  simplicity,  therefore,  only  selected  cases  will  be  presented,  and  the  data  'variability'  will  be 
qualitatively  expressed  by  the  range  of  surrogate  recovery  values  in  each  emission  source.  The 
results  are  shown  in  Table  II. 
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Table  II.  Summary  of  Results 


Case 

Variability,^  ' 

Test 
Results* 

K 

Conditional 
Odds 

A  posteriori 
Odds 

MS 

CT 

A 

83-120 

71-140 

18 

>2  10^ 

>8  10" 

10 

>2  10^ 

• 

3.6 

2000 

B 

34-290 

36-275 

8* 

2.9 

36 

15* 

3.3 

20* 

3.7 

C 

77-130 

30-330 

1.8* 

1.3 

2.7 

2.5* 

1.5 

2.0* 

1.4 

L 

54-185 

50-200 

0.9 

0.9 

1 

0.4 

0.4 

3.2 

2.7 

Notes:  t  Range  of  sunogaie  recovery  vtlues,  MS  and  CT  »re  the  two  emurion  jources;  t  Results  axe  erpressed  u  ratio  of 
compliance  level  to  reponed  value,  thus  values  of  <1  correspond  to  evidence  of  compliance  being  vioUted,  •  denotes  nondeteciable 
results;  The  conditional  odds  are  for  each  set  of  measurements,  and  the  a  posteriori  odds  are  for  aU  seu. 


The  following  interesting  features  can  be  noted: 

*In  Case  A,  the  evidence  of  compliance  being  met  is  so  overwhelming  that  a  loss  of  one  set 
of  data  would  not  really  matter.  There  is  no  need  to  prejudge  and  nullify  the  whole  effort, 
the  remaining  evidence  may  still  be  sufficiently  convincing. 

*Case  B  shows  that  even  if  the  performance  of  the  'analyst'  was  less  than  desirable,  the 
results  for  the  purpose  of  proving  compliance  may  still  be  satisfactory  (the  a  posteriori 
probability  of  compliance  being  met  is  97%).  The  explanation  is  the  reported  values  were 
much  lower  than  the  compliance  levels. 

*Case  C  shows  that  (i)  even  nondetectable  results  do  not  provide  a  convincing  proof  of 
compliance  (a  posteriori  probability  of  70%  as  compared  to  the  a  priori  value  of  50%),  and 
(ii)  when  the  reported  values  are  close  to  the  compliance  levels,  it  is  necessary  to  have 
small  data  variability. 

'"Case  D  shows  that  evidence  against  compliance  (kr<l)  can  also  be  included.  By 
coincidence,  the  a  posteriori  and  the  a  priori  odds  are  identical. 
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Concluding  Remarks 

The  method  that  has  been  developed  quantifies  what  is  intuitively  expected.  Data  quality  should 
not  be  judged  by  itself,  but  by  the  intended  use.  What  is  acceptable  in  one  case  may  not  be  so  in 
another  case.  The  Bayesian  approach  can  deal  with  nondetectable  results  in  a  logical  manner 
and,  in  the  author's  opinion,  it  produces  results  that  can  be  readily  understood. 

Incinerator  test  bums  represent  only  one  example  of  compliance  assessment,  which  in  many 
cases  can  be  formulated  as  Eq.  (1),  or  more  generally,  as  Eq.  (9).  Different  means  of  obtaining 
variability  and  different  statistical  models  may  be  needed,  but  the  same  approach  can  be  used  in 
other  similar  cases. 

Reference 

Liem,  A.J.  and  M.A.  Wilson,  "A  quantitative  method  for  evaluating  incinerator  test 
bum  results",  /.  Air  Waste  Manage.  Assoc.,  Vol  41,  No.  1,  Jan.  1991:  47-55. 
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ENVIRONMENTAL  CHEMICAL  ANALYSIS 
L  Johnson  and  Y.  Kumar 
Research  and  Methods  Development  Branch,  Chemistry  Division 
Alberta  Environmental  Centre,  Alberta  Environment, 
Vegreville,  Alberta 

Abstract 

Pollutants  from  many  point  sources  are  complex  mixtures  of  chemical  compounds.  Correlation  of 
pollutants  in  environmental  samples  with  point  sources  can  be  difficult  as  the  relative  amounts  of 
chemical  compounds  in  the  mixtures  may  be  altered  through  evaporation,  chemical  and  biological 
degradation.  The  multivariate  analysis  based  procedure  presented  here  is  specifically  directed  to 
the  analysis  of  polychlorinated  biphenyls  (PCBs)  although  it  may  be  applied  to  many  other 
analytes. 

A  multivariate  analysis  based  procedure  for  identification  and  quantification  of  complex 
mixtures  of  (PCBs)  as  in  industrial  and  environmental  samples  is  presented.  There  are  209 
individual  PCB  compounds  (congeners).  Aroclors,  industrial  preparations  of  PCBs,  are  complex 
mixtures,  comprising  of  as  many  as  60  separate  congeners.  Analysis  of  PCB  samples  by  capillary 
gas  chromatography  results  in  complex  chromatograms  with  many  peaks.  Identification  and 
quantification  of  PCBs  as  Aroclors  becomes  difficult  when  more  than  one  Aroclor  are  present. 
Identification  and  quantification  can  be  further  compHcated  by  chemical  or  biological 
degradation  which  changes  the  relative  amounts  of  congeners  present. 

This  method  uses  a  cahbration  solution  of  36  individual  PCB  congeners.  Authentic  Aroclors  and 
samples  are  analyzed  using  this  caUbration  mixtures.  Multivariate  analysis  of  these  results  is 
used  to  identify  and  quantify  PCBs  in  the  samples  as  Aroclors.  The  use  of  operators  to  describe 
PCB  degradation  is  also  discussed. 
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RESPONSE  SURFACE  ANALYSIS 
Robert  B.  Heimann 
Materials  Section 
Alberta  Research  Council 
Edmonton,  Alberta 


Hie  experimental  environment 

According  to  Tukey  [1],  industrial  experiments  can  be  classified  by  their  depth  of  intellectual 
investment  as  (i)  confirmation  experiments,  (ii)  exploration  experiments,  and  (iii)  fiindamental 
or  "stroke-of-genius"  experiments.  A  second  way  of  classification  is  based  on  the  distance  of  their 
objectives  fi*om  the  real  world,  i.e.  firom  the  market  [2].  Finally,  the  continuity  of  factors  provides  a 
third  classification.  If  the  factors  are  continuous  variables^  and  controllable  at  preset  levels,  then 
the  response  surface  methodology  is  the  method  of  choice.  If ,  however,  many  factors  are  orderable 
but  not  measurable,  i.e.  at  discrete  levels  ,  the  response  surface  analysis  becomes  less  useful  and 
should  be  replaced  by  nested  or  spHt-plot  designs  [3]. 

Every  experiment  attempts  to  approximate  the  real  world  but  must  avoid  by  a  set  of  simplifying 
assumptions  the  complex  interactions  occurring  in  real  systems.  There  are,  in  principle,  two 
ways  to  accomplish  this:  the  "classical"  experimental  strategy  that  varies  one  parameter  at  a  time 
but  attempts  to  keep  all  others  constant,  and  the  statistical  experimental  strategy  that  varies 
parameters  simultaneously  to  obtain  a  maximum  of  information  with  a  minimum  of 
experiments.  The  classical  experimental  strategy  yields  accurate  results  but  requires  many 
experiments.  It  gives,  however,  misleading  conclusions  to  problems  that  have  synergistic 
parameter  interactions,  and  also  fails  to  elucidate  the  "structure"  of  a  system.  Table  1  compares 
these  strategies  [4]. 


this  paper,  the  terms  "variable",  "parameter"  and  "factor"  will  be  used 
s  imultaneous ly. 
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Table  1:  Two  Viewpoints  of  the  "Real  Woiid" 


CLASSICAL 


STATISTICAL 


Number  of  runs 


many 
complex 
absent 
smal  1 


few 

s  imple 
present 
large(+) 


Response 
Synergi  sm 
Error 


Strategy 
Thought  pattern 


one-f actor-at-a 
vert  leal 


-time 


factorial 


lateral  [5] 


(+)  Bias  errors  can  be  considered  by  blocking  and  randomization;  random 
errors  can  be  accounted  for  by  replication  of  experiments. 

The  evolution  of  the  experimental  environment  usually  starts  with  a  screening 
(Plackett-Burman)  design  [6]  with  many  independent  (up  to  40)  variables.  It  yields  a  crude 
prediction  of  the  ranking  of  importance  of  parameters  through  a  first-order  polynomial  model. 
The  experimenters  should  list  and  investigate  all  possible  parameters  they  can  think  of  but  should 
refrain  from  skipping  some  because  of  "folklore",  laboratory  gossip,  or  preferences  and  hunches. 
"Be  bold  but  don't  be  stupid!"  [4].  The  tremendous  reduction  in  the  number  of  required 
experiments,  however,  will  be  offset  by  the  failure  to  detect  synergistic  interactions  between 
parameters.  On  the  other  hand,  an  advantage  of  the  screening  designs  is  that  they  can 
accommodate  a  mix  of  continuous  and  discrete  parameters. 

With  the  independent  parameters  (up  to  eight)  identified  to  influence  the  response  of  the  dependent 
parameters),  "limited  response  surface"  experiments  should  be  run  such  as  fiill  two-level 
factorials,  or  even  a  fractional  three-level  (Box-Behnken)  design  [7]  that  jrields  higher  quality 
predictions  by  allowing  interpolation  within  the  experimental  space  by  a  second-order  polynomial 
model.  Such  a  model  determines  non-linear  behaviour,  i.e.  the  curvature  of  the  response  surface, 
and  thus  permits  the  estimation  of  synergistic  parameter  interactions. 

The  polynomial  models  approximate  the  "true"  response  surface  only  in  the  necessarily  narrow 
region  of  the  investigated  parameter  space.  Thus,  any  extrapolation  beyond  the  proven  validity  of 
the  predictions  is  dangerous  and  may  lead  to  useless  or  even  nonsensical  results.  To  avoid  this. 


88 


eventually  a  theoretical  model  has  to  be  built  [8]  that  yields  the  exact  mathematical  response 
surface,  usually  by  the  application  of  first-order  differential  equations. 

EXAMPLE  1:  FRACTIONAL  TWO-LEVEL  FACTOMAL  DESIGN  28-4 


In  this  example,  the  thickness  of  88WC12Co  alloy  coatings  should  be  optimized.  These 
wear-resistant  coatings  are  being  applied  to  carbon  steel  surfaces  by  plasma  spray  technology  [9]. 
The  parameters  selected  for  the  fractional  two-level  factorial  screening  design  are  shown  in 
Table  2,  the  randomized  design  is  shown  in  Table  3. 


Table  2:  Parameters  and  Parameter  Levels  for  Example  Design 

Variable  X.                                   %"  Type  of  Variable 

Plasma  Arc  Current  700  amps  900  amps  Continuous 

Argon  Gas  Pressure  ^-^^  ^-^^  Continuous 

Helium  Gas  Pressure  X^  0.34  MPa  1.36  MPa  Continuous 

Powder  Gas  Pressure  X^  0.34  MPa  0.68  MPa  Continuous 

Powder  Feed  Rate  X^  low  (0.5)  high  (2)  Discrete 

Powder  Grain  Size  Xg  (-Ab*b)n  (-75-t-45)M  Continuous 

Number  of  Passes  X^           20                       30  Continuous 

Spray  Distance  Xg          25  cm                  45  cm  Continuous 


Table  3:  Randomized  Fractional  Two-Level  Factorial  Design 

Run  #     X(l)      X(2)    X(3)    X(4)    X(5)    X(6)    X(7)    X(8)      Response  Y  a 

(m)  (m) 


1 

700 

0. 

,34 

0, 

.34 

0, 

.68 

2 

Coarse 

20 

45 

118 

79 

2 

900 

0. 

,34 

0. 

.34 

0, 

.34 

0.5 

Coarse 

30 

45 

16 

8 

3 

700 

1, 

,36 

0, 

.34 

0, 

.34 

2 

Fine 

30 

45 

203 

111 

4 

900 

1. 

,36 

0. 

.34 

0, 

.68 

0.5 

Fine 

20 

45 

57 

25 

5 

700 

0, 

,34 

1, 

.36 

0, 

.68 

0.5 

F  ine 

30 

45 

82 

35 

6 

900 

0, 

.34 

1, 

.36 

0, 

.34 

2 

Fine 

20 

45 

138 

87 

7 

700 

1, 

.36 

1, 

.36 

0, 

.34 

0.5 

Coarse 

20 

45 

30 

12 

8 

900 

1, 

.36 

1, 

.36 

0, 

.68 

2 

Coarse 

30 

45 

82 

44 

9 

900 

1, 

.36 

1, 

.36 

0, 

.34 

0.5 

Fine 

30 

25 

7 

4 

10 

700 

1, 

.36 

1. 

.36 

0, 

.68 

2 

Fine 

20 

25 

108 

104 

11 

900 

0, 

.34 

1, 

.36 

0, 

.68 

0.5 

Coarse 

20 

25 

65 

30 

12 

700 

0, 

.34 

1, 

.36 

0, 

.34 

2 

Coarse 

30 

25 

16 

10 

13 

900 

1. 

.36 

0, 

.34 

0, 

.34 

2 

Coarse 

20 

25 

26 

21 

14 

700 

1, 

.36 

0, 

.34 

0, 

.68 

0.5 

Coarse 

30 

25 

9 

12 

15 

900 

0, 

.34 

0, 

.34 

0, 

.68 

2 

Fine 

30 

25 

30 

13 

16 

700 

0, 

.34 

0, 

.34 

0, 

.34 

0.5 

Fine 

20 

25 

22 

19 

This  28-4  fractional  factorial  design  is  a  1/16  replicate  of  a  fiill  2^  factorial.  It  has  a  resolution  IV  [8] 
and  is  able  to  estimate  the  eight  main  effects  Xj  clear  of  composite  two-factor  interactions  Ei  (Table 
6).  The  effects  of  higher-order  interactions  can  usually  be  safely  neglected.  Composite  effects  of 
the  sum  of  four  two-factor  interactions,  however,  can  be  estimated  from  the  unassigned  factors.  If 
no  interactions  exist,  the  effects  of  unassigned  factors  can  be  used  to  estimate  the  experimental 
error,  i.e.  the  minimum  significant  factor  effect.  The  arrangement  of  the  parameter  levels  in  the 
design  matrix  follows  Yates's  standard  order. 

Tables  4  and  5  show  the  numerical  evaluation  of  the  results.  First,  the  sum  S(+)  of  all  responses  on 
the  "+"-level  is  calculated.  Then  the  sum  S(-)  of  all  responses  on  the  "-"-level  is  calculated.  The 
factor  effect  is  the  difference  D  of  the  two  sums,  divided  by  the  number  of "+"  signs  in  each 
column.  The  coefficient  C  of  the  parameter  in  the  response  equation  is  the  factor  effect  divided  by 
two.  The  factor  significance  is  checked  against  the  minimum  factor  significance,  FE(min)  = 
s(FE)  *  t(a,df),  where  s(FE)  =  (1/n  SE^)^,  t(a,df)  is  the  Student  t-value  for  a  confidence  level  a  of  a 
double-sided  significance  test  and  df  degrees  of  freedom.  All  absolute  factor  effects  larger  than 
FE(min)  are  considered  to  be  significant. 


Table  4:  Computing  of  Main  Factor  Effects 


Run# 

Main 

X(l) 

X(2) 

X(3) 

X(4) 

X(5) 

X(6) 

X(7) 

X(8) 

Y 

1 

+ 

+ 

+ 

118 

2 

♦ 

16 

3 

+ 

+ 

+ 

+ 

203 

4 

+ 

+ 

+ 

57 

5 

+ 

♦ 

82 

6 

•f 

138 

7 

+ 

+ 

♦ 

+ 

+ 

30 

8 

+ 

"f 

+ 

+ 

+ 

+■ 

+ 

82 

9 

+ 

■f 

7 

10 

+ 

+ 

+ 

+ 

108 

11 

+ 

+ 

+ 

+ 

65 

12 

+ 

+ 

•f 

16 

13 

■f 

+ 

+ 

26 

14 

+ 

«f 

9 

15 

+ 

+ 

•f 

30 

16 

22 

!(*) 

1009 

421 

522 

528 

551 

721 

362 

445 

726 

0 

588 

487 

481 

458 

288 

647 

564 

283 

H 

1009 

1009 

1009 

1009 

1009 

1009 

1009 

1009 

1009 

A 

1009 

-167 

35 

47 

93 

433 

-285 

-119 

443 

Effect 

63 

-21 

4 

6 

t— 1 

54 

-36 

-15 

55 

C 

32 

-11 

2 

3 

6 

27 

-18 

-8 

28 
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I^le  5:  Computing  of  Composite  Two-fector  Interactions 


Run#     E(l)      E(2)      E(3)      E(4)      E(5)      E(6)      E(7)  Y 


1 
1 

•f 

- 

118 

o 
L 

+ 

16 

+ 

+ 

+ 

203 

4 

+ 

+ 

- 

- 

+ 

57 

c 

+ 

— 

+ 

- 

- 

82 

c 
0 

•f 

+ 

- 

— 

+ 

138 

7 

_ 

_ 

_ 

■f 

_ 

30 

8 

+ 

82 

9 

_ 

_ 

_ 

+ 

_ 

7 

10 

+ 

+ 

108 

11 

+ 

+ 

65 

12 

+ 

16 

13 

+ 

+ 

26 

14 

+ 

+ 

9 

15 

+ 

+ 

30 

16 

+ 

+ 

+ 

22 

410 

644 

505 

419 

604 

413 

448 

599 

365 

504 

590 

405 

596 

561 

IZ 

1009 

1009 

1009 

1009 

1009 

1009 

1009 

A 

-189 

279 

1 

-171 

199 

-183 

-113 

Effect 

-24 

35 

0 

-21 

25 

-23 

-14 

r 

-12 

18 

0 

U 

-U 

-7 

91 


The  pattern  of  the  "+"  and  "-"  levels  again  corresponds  to  Yates's  standard  order.  Note,  that  the 
second  half  of  the  main  effect  design  matrix  is  the  mirror  image  of  the  first  half,  and  that  the  two 
halves  of  the  composite  two-factor  interaction  design  matrix  are  identical.  The  confounding 
pattern  of  the  composite  two-factor  interactions  Ej  is  shown  in  Table  6. 


Table  6:  ConfiHinding  Pattern  of  Composite  Two-£Eu;tor  Interactions 


E(l) 
E{2) 
E(3) 
E(4) 
E(5) 
E(6) 
E{7) 


E.  = 


X3X, 


12 

•f 

37 

48 

56 

13 

•f 

27 

58 

+ 

36 

14 

•f 

28 

36 

57 

15 

+ 

38 

+ 

26 

47 

16 

•f 

78 

•¥ 

34 

•f 

25 

17 

"f 

23 

*■ 

68 

45 

18 

•f 

24 

•f 

35 

+ 

67 

From  Table  5  the  minimum  factor  effect,  FE(min)  can  be  calculated  as  follows. 

a(FE)  =  (1/n  11^)'^^  =  (3592/7)'/^  =  22.6  [1] 
FE(min)  =  a(FE)  *  t(a=0.90,df=7)  =  22.6  *  1.895  =  43.  [2] 


Thus,  all  factor  effects  whose  absolute  values  are  larger  than  43  are  significant  at  a  confidence 
level  of  90%.  From  Table  4  it  follows  that  X5  (powder  feed  rate)  and  Xg  (spray  distance)  are  the  only 
significant  main  factor  effects.  This  holds  true  even  when  the  confidence  level  is  increased  to 
95%.  In  this  case,  the  minimum  factor  effect  is  FE(min)  =  22.6  *  2.36  =  53.  There  are  no  significant 
composite  two-factor  interactions  (Table  5).  Both  main  factor  effects  have  positive  signs,  i.e.  the 
thickness  of  the  coating  increases  with  increasing  powder  feed  rate  and  increasing  spray 
distance.  Short  spray  distances  lead  to  overheating  of  the  alloy  powder  thus  causing  thermal 
decomposition  and  reaction  of  the  tungsten  carbide  with  the  cobalt  matrix.  This  will  eventually 
result  in  brittle  phases,  loss  of  carbon,  and  higher  coating  porosities  [10]. 

The  response  polynomial  of  the  thickness  of  plasma  sprayed  88WCl2Co  alloy  coatings  can  be 
roughly  (zero-order  approximation)  expressed  by  the  equation 

d  [m]  =    32  +  27  Xg  ^  28  Xg.  [3] 
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EXAMPLE  2:  FULL  TWO-LEVEL  FACTORLVL  DESIGN  2* 


The  estimation  of  radioactive  source  terms  for  the  safety  analysis  of  a  nuclear  fuel  waste 
repository  involves  laboratory  leaching  experiments  to  determine  the  durability  of  used  UO2  fuel 
and  fuel  recycle  waste  glass  under  conditions  relevant  to  the  disposal  of  these  highly  radioactive 
materials  deep  in  a  granitic  pluton  of  the  Canadian  Shield  [11].  Table  7  shows  the  nine  parameters 
selected  for  leaching  in  two  different  groundwaters  of  used  UO2  fuel  and  a  borosilicate  glass 
containing  90-Sr,  137-Cs  and  several  actinides  such  as  239-Pu,  241-Am  and  244-Cm.  In  this 
example,  only  the  responses  of  the  amount  of  hydrogen  formed  by  radiolysis  of  the  groundwater 
and  of  the  normalized  mass  loss  of  strontium  will  be  examined. 


Table  7:  Parameters  and  Parameter  Levels  for  Example  Design 

Variable  X.  Type 

Waste  Form  Glass  UOj  fuel  Discrete 

CEC  of  Clay  X2  160  meq/kg  1350  meq/kg  Continuous 
Ionic  Strength  of 

Groundwater  X3  10~     mol/L              1.4  mol/L  Continuous 

Surface  Area/  _^  ^ 

Volume  Ratio  X^  12  m~                    120  m~  Continuous 

Metal  X^  Ti-grade  12  Constant 

Rock  Xg  Granite  Constant 

Pi-essure  X^  10  MPa  Constant 

Temperature  Xg  200  C  Constant 

Time  Xg  6  months  Constant 


The  geometrical  representation  of  such  a  2*  design  is  a  4D-hypercube  as  shown  in  Figure  1.  The 
four  selected  variables  are  the  four  axes  of  this  hypercube. 

With  the  technique  executed  in  detail  in  the  previous  example,  the  factor  effects  and  the  coefBcients 
of  the  response  equations  were  calculated.  The  amount  of  hydrogen  formed  during  g-radiolysis  of 
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I  (HE)  I 


FIGURE  1:  4D-Hypercube  as  a  geometrical  representation  of  a  2*  design. 


Co«ffldemx10«  RMidualt(y-9)x10< 


FIGURE  2:  Empirical  cumulative  distribution  of  the  coefficients  of  a  first- 
order  polynomial  for  the  amount  of  hydrogen  in  volt  developed  by 
7-radiolysis  of  groundwater  (left). 

FIGURE  3:  Empirical  cumulative  distribution  of  the  residuals  (amount  of 
hydrogen)  (right). 
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the  groundwater  can  be  described  by  the  polynomial 

[H,]  *  lo'  =  27  .  19  X,  .  8  X,  .  8  X,X,  -  4  X,    (1n  VoU).  [« 

The  amount  of  hydrogen  formed  is  a  strong  positive  function  of  the  activity  of  the  waste  form,  Xi, 
and  the  ratio  of  the  surface  area  of  the  solid  to  the  volume  of  the  solution,  X4.  It  is  negatively 
correlated  with  the  ionic  strengths  of  the  groundwater,  X3.  There  is  also  a  rather  strong  two-factor 
interaction  X1X4.  Figure  2  shows  the  empirical  cumulative  distribution  of  the  15  calculated 
coefficients  of  the  first-order  polynomial  on  a  probability  net.  If  all  the  coefficients  would 
randomly  fluctuate  in  a  Gaussian  fashion  around  a  statistical  mean  value  than  the  distribution 
would  follow  exactly  a  straight  line.  Deviations  from  this  line  signify  non-random  factor  effects, 
in  this  case  Xi  ,X3  and  X4.  With  the  response  equation  shown  above  the  residuals  were  calculated 
and  also  plotted  on  a  probability  net  (Figure  3).  The  figure  indicates  a  reasonable  qualitative  fit  of 
the  assumed  model  to  the  true  response  surface. 

The  normalized  mass  loss  of  90-Sr  is  shown  in  Figure  4  in  a  4D-hypercube.  The  four  axes  of  the 
hypercube  are  the  selected  parameters  Xi  to  X4  (see  Figure  1).  The  inner  cube  contains  the  data 
obtained  by  leaching  of  used  fuel,  the  outer  cube  contains  the  data  obtained  by  leaching  of  fuel 
recycle  waste  glass.  The  complex  response  equation  is 


'  =  42 

*  15  x;x,  ♦  uXXj    [Hg/rn'].  '  "     "    '  '^  £5] 


NML(Sr)*10    =  42  -  32  X,^-  17  X^^-JZ  X,  .  J6  X,Xj  .  21  X.X^ 


The  empirical  cumulative  distributions  of  the  residuals  for  NMIXSr),  and  NMIXCs),  are  shown  in 
Figure  5. 

It  was  suspected  that  all  two-factor  interactions  but  X2X3  are  merely  perturbations  of  the  parameters 
X2,  X3  and  X4.  The  interaction  of  the  cation  exchange  capacity  of  the  clays,  and  the  ionic  strength  of 
the  groundwaters,  however,  determines  the  pH  of  the  solution.  With  this,  a  table  of  factor 
assignment  can  by  produced  (Table  8). 
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FIGURE  5:  Empirical  cumulative  distribution  of  tht  residuals  of  the  normalized 
mass  losses  of  90-Sr  and  137-C$. 


FIGURE  6:  Anscombe-Tukey-plots  of  residuals  vs.  linear  (top)  and  parabo 
(bottom)  normalized  mass  loss  of  90-Sr. 
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Table  &  Order,  Signs  and  Assignment  of  Factors,  and  Hieir  Effect 
on  the  Normalized  Mass  Loss  of  Strontium-90 


Factor 


Sign 


Assignntent 


Effect 


Waste  Form 
Ionic  Strength 
Perturbation  of  Xj 
Perturbation  of  X^ 
Perturbation  of  Xj 
SA/V  ratio 
pH 


Fuel  shows  smaller 
NML(Sr)  than  glass 
Increasing  I 
decreases  NHL(Sr) 


Increasing  SA/V 
decreases  NHL(Sr) 
Increasing  pH 
increases  NML(Sr) 


The  three-factor  interaction  X1X2X3  =  X4X5  is  suspiciously  large  thus  suggesting  the  involvement 
in  the  leaching  process  of  corrosion  products  due  to  interaction  of  the  SA/V  ratio  and  the  type  of 
metal. 


The  plot  of  the  residuals  vs.  the  response  Y  (Figure  6,  top),  also  called  Anscombe-Tukey  plot,  shows 
three  straight  lines  instead  of  the  random  distribution  of  data  points  above  and  below  the  zero  line 
as  required  for  a  good  lit  of  the  data  to  the  model.  This  suggests  a  Y-data  transformation  to  achieve 
better  fit  to  the  model. 


Indeed,  several  data  transformations  increase  the  regression  coefficient  (coefficient  of 
determination)!  from  0.928  for  Y  (linear)  to  0.960  for  Y'^^  to  0.963  for  Y'^^,  considering  the  seven 
parameters  identified  above.  Figure  6  (bottom)  shows  random  scatter  of  the  data  points  for  the  Y^^ 
transformation.  This  could  mean  that  the  true  response  surface  has  a  parabolic  or  cubic  character. 
However,  the  selected  experimental  design  does  not  allow  for  non-linear  parameters  to  determine 
the  curvature  of  the  response  surface.  A  different  model  must  be  built,  and  additional  experiments 
must  be  run. 


1  R2=SSREG/SST0T;  SSREG  =  Sum  of  squares  of  regression,  i.e.  fraction  of  the  total  scatter  of  the 
original  observations  that  is  accounted  for  by  the  selected  effects.  SSTOT  =  Total  sum  of  squares  of 
deviation  of  the  observations  about  their  mean. 


97 


EXAMPLE  3:  THREE-LEVEL  FRACTIONAL  FACTORLVL  (BOX-BEHNKEN)  DESIGN 
This  example  deals  with  the  dissolution  of  a  simulated  borosilicate-based  nuclear  waste  glass  in 
the  presence  of  three  different  clays  and  three  different  groundwaters  [12].  Three  parameters  were 
varied  at  three  levels  (Table  9).  This  is  the  minimum  number  of  levels  for  each  parameter  to 
estimate  non-linear  responses.  Designs  with  more  than  three  levels  yield  higher  quality 
predictions  such  as  the  five-level  central  composite  [13]  or  the  Box- Wilson  [14]  designs  but  require 
many  more  experiments.  A  Box-Behnken  design  is  a  subset  of  a  full  three-level  factorial,  33-f  that 
uses  13  of  the  27  points  of  the  full  factorial  plus  2  extra  replicates  at  the  centre.  There  are  5  more 
points  than  the  minimum  10  points  required  to  estimate  the  10  parameters  of  the  second-order 
polynomial  (3  hnear,  3  quadratic,  3  two-factor  interactions,  1  three-factor  interaction).  Thus  the 
design  provides  5  degrees  of  freedom  for  error.  Geometrically  the  Box-Behnken  design  can  be 
described  by  an  edge-centred  cube  with  three  centre  points  (Figure  7). 

Table  9:  Parameters  and  Parameter  Levels  for  Example  Box-Behnken  Design  ^ 


Variable  "0" 


Cation  Exchange 

Capacity  of  Clay,X,    160  meq/kg     490  meq/kg     1360  meq/kg 

^  or  820  meq/kg($) 

Ionic  Strength  of       lO'    mol/L     0.013  mol/L    0.07  rool/L 
Groundwater,  X2 
Ratio  Clay  to 

Groundwater, X3  0.01  kg/L       0.05  kg/L       0.10  kg/L 

($)  The        level  was  split  between  Ca-montmor i 1 lonite  (CEC= 
1360  meq/kg),  and  Na-montmori 1 lonite  (CEC=820  meq/kg). 


Figure  7  shows  the  experimental  results.  Responses  measured  were  the  specific  mass  loss  of  the 
glass,  the  mass  of  dissolved  silicon,  the  mass  of  dissolved  boron,  all  in  kg/m^,  as  well  as  the  final 
pH  of  the  leach  solution  measured  at  room  temperature.  The  right-hand  plane  of  the  Box-Behnken 
design  cube  shows  both  the  results  of  sub-design  A  (using  Ca-montmorillonite)  and  B  (using  Na- 
montmorillonite).  Figure  8  shows  the  calculated  response  surfaces  of  the  specific  mass  losses  of 
the  glass  as  the  function  of  the  coded  levels  of  the  cation  exchange  capacity ,Xi  and  the  ratio 


98 


[Si]  X  10  3 


FIGURE  7:  Box-Behnken  design  cube  with  experimental  results  of  normalized  mass 
loss  of  boron  (1st  quadrant),  pH  (2nd  quadrant),  normalized  mass  loss 
of  silicon  (3rd  quadrant)  and  specific  mass  loss  of  the  glass 
(4th  quadrant)  for  sub-designs  A  and  B  (see  text). 


X,  ■  -1.0  (Granitic  groundwater)  X,  ■♦1.0  (Star>dard  Canadian  Shield 

Saline  Solution) 


FIGURE  8:  Response  surfaces  of  specific  mass  losses  of  the  glass  as  a  function 
of  the  cation  exchange  capacity  of  the  clay.  X.  and  the  ratio  clay/ 
groundwater,       for  constant  ionic  strengths  of  the  groundwater,  X 
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clay/ground  water, Xa  at  constant  ionic  strengths  of  the  groundwater  (left  column:  granitic 
groundwater,  1=10-^  mol/L;  right  column:  saline  solution,  1=0.07  mol/L;  top  row:  Sub-design  A 
with  Ca-montmorillonite  at  "+"  level;  bottom  row:  Sub-design  B  with  Na-montmorillonite  at  "+" 
level).  The  specific  mass  loss  of  the  glass  in  kg/m2  is  given  by 

Am  *  10^  =  12.7  *  3.2         -  1.3  X3  *  2.5  X^Xj  -  2.1  X^Xj  -  1.4  X^Xj.  [6] 

From  Figure  8  it  appears  that  the  use  of  Na-montmorillonite  as  buffer  material  to  isolate  the 
nuclear  fuel  waste  containers  from  groundwater  is  preferable  over  the  use  of  Ca-montmorillonite. 
The  latter  interacts  strongly  with  groundwater,  and  produces  exceptionally  low  pH  values  that 
promote  the  attack  of  the  waste  glass  as  shown  by  the  high  specific  mass  losses  of  the  glass  in 
contact  with  Ca-montmorillonite  (Figures  8a  and  8c).  The  interaction  of  clay  and  groundwater 
can  also  be  seen  in  Figure  9.  Whereas  Ca-montmorillonite  produces  pH-values  of  3  in  granitic 
groundwater  (X2  =  -1)  and  even  lower  values  in  saline  solution  (X2  =  +1)  (Figure  9a),  the 
Na-montmorillonite  produces  pH-values  of  9  in  granitic  groundwater  and  around  7  in  saline 
solution  (Figure  9b). 

Figure  10  shows  the  probability  plot  of  the  coefficients  of  the  response  equation  of  the  normalized 
mass  loss  of  silicon  from  the  glass.  It  can  be  seen  that  the  three  linear  parameters  Xi,  X2  and  X3  do 
not  fit  the  Gaussian  straight  line  thus  indicating  significant  parameter  effects.  Indeed,  the  linear 
response  equation  yields 

NML(Si)  *  10^  =  5.5  *  4.0  X^  -  1.3  X2  -  1.4  X3    [kg/m^].  [7] 

On  the  other  hand,  the  normalized  mass  loss  of  boron  follows  a  parabolic  rate  law: 

NML(B)  *  10'  =  13  *  61  X,^  *  39  X,  -  13  X^X^    [kg/m].  [8] 

Figure  11  shows  the  contour  lines  that  indicate  a  trough  with  negative  slope  with  increasing  ionic 
strength  of  the  groundwater  due  to  the  negative  coefficient  of  the  two-factor  interaction  parameter 
X1X2.  The  parabolic  contours  are  due  to  the  strong  contribution  of  the  quadratic  Xi  parameter. 
Figure  12  shows  the  probability  plot  of  the  coefficients  of  the  second-order  response  equation  of  the 
normalized  mass  loss  of  boron.  The  linear  and  quadratic  coefficients  of  parameter  Xi  due  not  fit 
the  Gaussian  line  thus  indicating  effects  significantly  different  from  experimental  noise.  A 
simple  variance  test  confirms  that  the  fitted  response  surface  was  estimated  with  sufficient 
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FIGURE  9:  Development  of  pH  for  sub-designs  A  (top)  and  B  (bottom)  as  a 

function  of  the  cation  exchange  capacity  of  the  clay,  X,  and  the 
ionic  strengths  of  the  groundwater.  X^. 
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Cotfficients  x  10^ 

FIGURE  10:  Empirical  cumulative  distribution  of  the  coefficients  of  the 

second-order  polynomial  for  the  normalized  mess  loss  of  silicon. 


101 


-1.0       -0.5         0         0.5  1.0 
Codtd  cation  •xchongt  capacity 

FIGURE  11:  Contour  lines  (isopleths)  of  the  normalized  mass  loss  of  boron 
in  the  X^-Xj  plane. 
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FIGURE  12:  Empirical  cumulative  distribution  of  the  coefficients  of  the 
second-order  polynomial  for  the  normalized  mass  loss  of  boron 
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precision  using  the  equation  V(Y)  =  p*a2/n,  where  p  =  number  of  parameters  fitted,  a2  =  estimate  of 
error  variance  of  the  replicated  centre  point,  and  n  =  number  of  experiments.  Accordingly,  V(Y)  = 
(3X5.33*10-5)/15  =  1.066*10-s,  and  a  =  1.032*10-6.  Figure  11  shows  that  the  fitted  Y  data  range 
approximately  from  20*10-6  to  120*10-6.  Thus  we  have  failed  to  show  any  substantial  lack  of  fit. 
The  predicted  change  of  Y  is  indeed  97  times  the  standard  error  of  Y. 

Ck>nclusion 

The  response  sunace  methodology  is  a  convenient  and  easy  tool  to  estimate  how  a  particular 
response  is  affected  by  a  given  set  of  independent  variables  over  some  specific  region  of  interest. 
Furthermore,  it  allows  to  determine  those  values  of  input  variables  at  which  a  particular  response 
is  maximized  or  minimized  ,  and  gives  also  information  on  the  character  of  the  response  surface 
close  to  the  extremum. 

One  should  always  try,  if  time  and  resources  permit,  to  develop  the  full  experimental  strategy  by 
proceeding  from  screening  (Plackett-Burman  design  or  fractional  two-level  factorial)  to  "limited 
response  surface"  analysis  (full  two-level  factorial)  to  "response  surface"  analysis  per  se 
(Box-Behnken  design  or  other  three-level  factorials)  to  "exact  model"  building. 

Many  aspects  of  the  response  surface  methodology  could  not  be  dealt  with  in  this  context  such  as 
orthogonal  blocking,  detailed  analysis  of  variance,  and  canonical  analysis.  These  subjects  are 
described  in  more  detail  in  numerous  textbooks,  for  example  in  Box,  Hunter  and  Hunter  [7]. 

It  should  be  emphasized  again  that  the  methodology  described  in  this  article  is  just  a  convenient 
tool,  not  gospel.  Common  sense,  critical  judgement,  even  Ockham's  rasor  and  the  "KISS"  strategy 
have  to  be  applied  at  all  stages  of  the  analysis  to  yield  meaningful  results. 
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SAS  STATISTICAL  APPLICATIONS 
Serge  Dupids 
Software  Suppcrt 
Alberta  Public  Works,  Supplies  and  Services 


Introduction 


SAS     (Statistical  Analysis  System)  is  a  comprehensive  package  designed  for  data  analysis.  It 
includes  a  programming  language  for  reading  and  manipulating  data  of  almost  any  form  and  a 
number  of  procedures  designed  for  analysing  and  displaying  the  data. 


SAS  offers  an  integrated  approach  to  building  systems  by  replacing  several  software  packages  for 
editing,  database  functions,  programming,  reporting  and  graphics  into  a  single  system.  Although 
this  paper  deals  primarily  with  statistics,  SAS  should  be  considered  as  an  complete  system.  Add- 
ons to  the  base  product  include: 


SAS/FSP        Interactive  facility  for  data  entry  .retrieval. 

SAS/GRAPH  Business  graphics(Pie^ar  charts).  Scientific  graphics  (contouring,  mapping,  3D 
modelling). 

SAS/AF         Application  facility  to  create  menus  and  turnkey  applications. 
S AS/OR         Operations  research,  decision  support.project  management.  Includes  Gantt  charts, 
Critical  path  Analysis 

SAS/STAT     Wide  range  of  statistical  procedure  for  analysis  and  modelling  such  as  ANOVA, 

Regression,  Chi-square,  survival  analysis  and  others. 
SAS/ETS        Econometric8,financial  planning,  time  series  modelling  including  ARIMA, 

forecasting  time  series,  cross  spectral  analysis. 
SAS/IML        Interactive  matrix  programming  language  similar  to  APL  which  can  be  used  for 

regression. 

SAS/QC  Quality  control  software  including  SHEWHART  and  several  experimental 

design  macros. 


105 


SAS  is  available  on  mainframes,  minis  and  PCs  on  several  operating  systems  including  VAX, 
UNIX,  MVS,  and  PC-DOS  and  Macintosh  so  that  analyses  and  data  can  be  readily  shared  among 
scientists,  worldwide.  The  version  for  PC's  is  leased  to  Alberta  Grovemment  Departments  and 
Agencies  through  a  master  license  at  Public  Works,  Supplies  and  Services  (PWSS).  The  Base 
product  is  leased  from  $90/year  with  add-ons  from  $40/year.  The  cost  includes  support  and 
upgrades,  but  not  manuals. 

Manuals 

All  SAS/PC  users  should  obtain  the  LANGUAGE  guide  (#  P5856.  $19.95US)  and  PROCEDURES 
guide  (#  P5856,  $16.95US).  For  specific  applications,  the  following  listed  in  the  reference  section 
may  also  be  reqiiired. 

A  master  index  of  all  documentation  is  highly  recommended  (publication#  56000).  Many 
manuals  can  be  ordered  from  PWSS  (charged  to  mainframe  account),  others  from  SAS  directly 
(DPO  direct  to  SAS).  A  free  semi-annual  catalog  of  all  publications  is  available  from  SAS.  The 
address  to  order  directly  is: 
SAS  (Canada) 

225  Duncan  Mills  Road,  Suite  300 
North  York,  Ontario 
M3B3K9 

Phone  (416)  443-9811  Fax:  (416)  443-1269 
Hard  ware  Required 

SAS/PC  will  require  a  PC/AT  class  computer  or  better  such  as  IBM  PS-2/Model  60-80  and 
COMPAQ  286/386.  A  386  machine  is  recommended  for  future  upgrades  of  SAS/PC  under 
Windows.  Expanded  memory  is  not  normally  required  but  will  be  very  useful.  A  math 
co-processor  will  he  useful  for  complex  statistical  or  graphical  applications.  With  the  addition  of 
an  IRMA  or  similar  card  for  communication,  SAS/PC  programs  can  be  executed  on  a  mainframe 
where  large  data  files  can  be  processed.  Be  prepared  to  allocate  10  to  30  megabytes  for  storage  on  the 
PC.  This  could  be  stored  on  a  Local  Area  Network  (LAN)  server  as  SAS/PC  operates  under  many 
LANs  such  as  NOVELL,  IBM,  Banyan-Vines  and  others. 
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Training 

SAS  has  a  steep  learning  cxirve,  but  fortunately,  users  do  not  need  to  understand  the  whole  system  to 
be  able  to  effectively  analyse  their  data.  It  is  highly  recommended  that  users  take  the  tutorial 
provided  with  the  installation  diskettes  to  get  familiar  with  windows,  functions  keys  and  basic 
syntax  of  the  SAS  language. 

This  basic  training  can  be  enhanced  with  other  computer  based  tutorials  available  from  Software 
Support,  classroom  courses  and  seminars  which  are  held  several  times  a  year.  The  software  also 
includes  several  help  screens  and  fill  the  blanks  menus  which  can  be  activated  by  function  keys 
(Fl=help)  or  by  typing  MENU  on  the  command  line.  For  those  wanting  an  automated  system, 
SAS/Assist  and  SAS/AF  can  be  used  to  build  turnkey  systems  requiring  little  knowledge  of  SAS  by 
its  users. 

Each  product  also  has  its  own  set  of  sample  programs,  complete  with  data  and  ready  to  run.  The 
index  is  listed  in  files  called  *.BLS.  For  example,  the  INDEXSTT.BLS  file  contains  names  of 
sample  programs  in  the  SAS/STAT  guide.  To  run  any  of  these  programs,  you  need  to  invoke  SAS 
and  INCLUDE  the  program  name,  for  example,  ANOVAEX  is  included  with  the  command: 

command===>  INCLUDE  'ANOVAEX.SAS' 

The  program  is  ready  to  use  by  pressing  the  execute  (flO)  key. 
A  partial  list  of  samples  for  SAS/STAT: 


ANOVAEX 

Documentation  Examples  from  PROC  ANOVA 

TTESTEX 

Documentation  Examples  from  PROC  TTEST 

CATKAPPA 

KAPPA  STATISTIC  COMPUTATION 

FREQTREN 

TREND  TEST  USING  PROC  FREQ 

CLUSTER 

CLUSTER  ANALYSIS  OF  MAMMALS'  TEETH  DATA 

FASTCLUS 

USING  MACROS  TO  ANALYZE  ARTIFICLU.  FIVE-GROUP 

DEXOC 

Macros  for  Central  Composite  Designs 

DEXPB 

Macros  for  Plackett-Burman  Designs 

PLANEX 

Documentation  Examples  from  PROC  PLAN 

CANDISC 

CANONICAL  DISCRIMINANT  ANALYSES  OF  CARS  DATA 

CANDPOLY 

POLYNOMLAL  CANONICAL  DISCRIMINANT  ANALYSIS 

DISCKERN 

KERNEL  DISCRIMINANT  ANALYSES  OF  IRIS  DATA 

INTDISC 

EXAMPLE  FROM  INTRO  TO  SAS  DISCRIMINANT  PROCS 
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STEPDISC 


STEPWISE  DISCRIMINANT  ANALYSES  OF  CARS  DATA 


Gettizi«r  Data  into  SAS 


There  are  several  method  of  entering  data  into  SAS: 

-  Internal  to  the  programs,  using  the  INPUT  and  CARDS  statements.  This  is  acceptable 
when  analysing  small  amounts  of  data. 

-  External  to  SAS,  in  DOS  'flat  files'.  These  can  be  of  any  format  and  size,  read  with 
FILENAME  and  INPUT  statement. 

-  In  Dbase  II,  III  ,111+  or  Dbase  IV  files  which  can  be  converted  very  simply  by  using  the 
DBF  procedure. 

-  In  LOTUS  1-2-3  files  pre-converted  to  DIF  or  to  printed  to  a  PRN  file  and  then  read  with 
an  INPUT  statement. 

-  Using  SAS/FSP  for  entering  the  data.  This  will  also  do  quality  control  on  data  entry  and 
allow  a  screen  to  be  any  input  form.  This  method  is  preferred  for  large  numbers  of 
variables  or  where  data  entry  is  critical.  A  specific  tutorial  is  available  for  learning 
SAS/FSP. 


ASanqile  Session  with  SAS/PC 


This  is  a  sample  program  which  reads  data  from  a  CARDS  list  and  does  a  simple  means 
calculation. 


=  OUTPUT  — 
Cominand==> 


=LOG  — 
Comnanda 


||»PROGRAM=«' 
Command=»> 


00001 

data ; 

00002 

input  X  y; 

00003 

cards ; 

00004 

1  3 

00005 

3  5 

00006 

4  8 

00007 

00008 

PROC  MEANS; run; 

(Press  FIO  to  execute  the  program) 
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The  output  is  displayed  on  the  output  screen.  To  access  it,  press  f5  and  view  the  output: 


=  OUTPUT  =========================^=======^^ 

Coiiunand==> 

N    Obs    Variable    Minimum    Maximum  Mean  Std  Dev 
3  X  1.00000      4.00000   2.666  1.527 

3  X  3.00000      8.00000   5.333  2.516 

=LOG 

Cominand==> 


=PROGRAM  = 
Cominand==> 

00001 
00002 
00003 


Function  Keys 

Pressing  F2  will  display  current  keys  assignment.  The  most  useful  function  keys  are: 
Fl-  help 
F2-  keys 

F5-  Jump  to  next  window 
F7-  Zoom  window  to  full  screen 
F9-  Recall  program 
FIO  -  END  /  Execute  program 
Many  other  function  keys  are  assigned  and  all  are  user  definable. 

ComnuuDid  Line 

There  is  a  command  line  on  every  window.  The  most  common  commands  are: 
Command=s>  X  Shell  to  DOS  (exit  to  return) 

Commands=>  Include  'Xdosfile'  Read  a  DOS  file  to  window 

Command=s>  File  Adosfile'  Write  a  DOS  file  to  disk 

eg  to  printer:  FILE  '\lptl' 
Command==>  BYE  Finish  SAS  Session 
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Cominand==>  MENU 


Call  procedure  menu 


Statistical  Products 


The  base  package  includes  very  limited  statistical  procedures  for  descriptive  statistics 
(mean,median,  modes  etc.).  More  complex  statistical  procedures  can  be  found  in  SAS/STAT, 
SAS/ETS,  SAS/IML  and  SAS/QC. 


SAS/STAT  Product 


REGRESSION  PROCEDURES: 

CATMOD        Analyses  data  in  continency  tables 

Least  square  fit  for  simple,  multiple,  polynomial  and  weighted  regression 
Fits  parametric  models  to  failure  data  (survival  analysis) 
Non  linear  regression,  such  as  Gauss-Newton 
For  ill  condition  data  using  the  Gentleman-Givens  method  Use 
colinearity  diagnostics  of  REG  to  determine  if  ORTHOREG  is 
needed. 

Linear  regression  with  method  selection  from  nine  options  such  as 
backward,  forward,  stepwise,  r-square 
Builds  response  surface  models 

Obtains  linear  and  nonlinear  transformations  of  variables  using 
alternating  least  squares  to  fit  data  to  linear  regression  canonical 
regression  and  anova  models 


GLM 

UFEREG 
NUN 

ORTHOREG 


REG 

RSREG 
TRANSREG 


ANALYSIS  QF  VARIANCE; 

ANOVA         Includes  multivariate  anova  and  repeated  measures  anova  with  several 

comparison  tests.  DO  NOT  use  for  unbalanced  data,  use  GLM  instead 
CATMOD      Fits  linear  models  for  categorical  data 

GLM  Regression,  analysis  of  covariance,  repeated  measures  analysis, 

multivariate  anova,  hypothesis  tests,  test  of  means 
NESTED        Anova  and  analysis  of  covariance  on  nested  random  models 
N  PARI  WAY  Non-parametric  one-way  of  rank  scores 

PLAN  Constructs  designs  and  randomizes  plans  for  nested  and  crossed 
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TTEST 
VARCOMP 


experiments 

Compare  means  of  two  groups 

Estimate  of  variance  components  for  random  or  mixed  models 


rATEOORICAL  ANALYSIS: 

CATMOD       Fits  linear  models  to  functions  of  categorical  data 

FREQ  Builds  tables  or  continency  tables  with  chi-squares,  Fishers'  test 

MTTT.TTVARIATE  ANALYSIS: 

PRINCOMP    Principal  component  analysis,  output  component  scores 
FACTOR        Principal  component  and  common  factor  analysis  with  rotations 
CANCORR     Canonical  correlation  analysis 

DISCRIMINANT  ANALYSIS: 

DISCRIM       Compute  discriminant  functions,  including  non-parametric 
methods 

CANDISC       Canonical  analysis  to  find  linear  combinations  of  quantitative 
variables 

STEPDISC      Forward,  backward  or  stepwise  selection 


CLUSTERING  PROCEDURES; 

CLUSTER      Hierarchical  clustering  using  11  methods  applied  to  coordinate  or 
distance  data 

FASTCLUS  Finds  disjoint  clusters  using  k-means  (up  to  100,000  observations) 
VARCLUS      Hierarchical  and  disjoint  clustering  by  oblique  multiple  group 

component  analysis 
TREE  Draws  tree  diagrams  (dendrograms  or  phenograms) 


The  following  may  be  used  prior  to  clustering: 


ACECLUS      Estimate  of  pooled  within  cluster  covariance  matrix 
PRINCOMP    Principal  component  analysis 

STANDARD  Standardizes  variables  to  specified  mean  and  variance 
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CORING  PROrFPTrRES: 

STANDARD  Standardizes  variables  to  specified  mean  and  variance 
RANK  Ranks  numeric  variables  from  high  to  low 


SCORE  Constructs  new  variables  that  are  linear  combination  of  old 

variables  according  to  a  scoring  dataset  (used  with  PROC  FACTOR) 


STTRVTVAL  ANALYSIS: 

This  is  used  for  data  that  measure  length  of  time  to  occurrence  of  an  event,  for  example 
mean  time  before  failure,  or  length  of  time  a  person  stays  on  the  job. 
LIFEREG       Fits  parametric  accelerated  failure  time  or  regression  models 
LIFETEST     Computes  nonparametric  estimate  of  survival  distribution 


OTHER  SAS/STAT  PROCEDURES 

CORRESP      Simple  and  multiple  correspondence  analyses.  Reads  a  continency  or  Burt 

table  or  creates  these  tables  from  raw  data.  Also  named  Appropriate 

scaling,  reciprocal  averaging. 
PRINQUAL    Obtains  linear  and  nonlinear  transformations  of  variables  using 

alternate  least  squares. 
PROBIT        Maximum  likelihood  estimates  of  regression  parameters  and 

threshold  response  rate  for  biological  assay  quantal  response  data 
CALIS  Covariance  analysis  of  linear  structural  equations 

LOGISTIC      Fits  linear  logistic  regression  models  for  binary  or  ordinal  data 


SAS/ETTS  (Eoonometric  and  time  series)  PRODUCT 


SAS/ETS  has  a  number  of  statistical  procedures  for  analying  time  series.  This  includes: 

Econometric  models  for  market  analysis  or  macro  economics 

Corporate  financial  modelling  including  planning  equations 

Physical  models  for  mechanics,  hydraulic  and  hydrologic  models 

Biological  models  to  simulate  living  systems 
-   Ecological  models  to  represent  systems  in  nature. 

ARIMA         Autoregressive  integrated  moving  average  process(Box-Jenkins) 
AUTOREG     Regression  allowing  serially  correlated  error 
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FORECAST    Forecast  series  using  trend- adjusted  autoregressive  or  exponentially  smoothing 
models 

PDLREG        Multiple  regression  with  polynomial  distributed  lag 

SPECTRA      Computes  periodograms,  smoothed  spectral  density  estimates,  white  noise  tests 
STATS  PACE  Autocorrelation  of  stationary  vector  time  series  by  state  space  models 
Xll  Seasonally  adjusts  quarterly  or  monthly  time  series 


A  Sample  Time  Series  Application 


The  following  is  a  sample  program  to  extrapolate  a  time  series  using  the  FORECAST  procedure. 


data  a ; 

input  month  year  date  :monyy. 

crude  coal; 
format  date  monyy. ; 
label  crude=' CRUDE  PETROLEUM'; 
cards ; 

1  1965  JAN65  24.09  40.015 

2  1965  FEB65  21.86  37.862 

3  1965  MAR65  24.38  42.816 

4  1965  APR65  23.68  41.862 
...more  data  lines... 

11  1972  NOV72   28.28  56.297 

12  1972  DEC72  28.94  44.904 

proc  forecast  data=a  out=b  outest=c 

trend=2  outactual  outlstep  outlimit  interval=month  lead=15; 

id  date; 

var  crude  coal; 
proc  print  data=c; 

,  title  'The  Estimates  from  PROC  FORECAST'; 


The  output  dataset: 


The  Estimates 

from 

PROC  FORECAST 

OBS 

_TYPE_ 

DATE 

CRUDE 

COAL 

1 

N 

DEC72 

96 

96 

2 

SIGMA 

DEC72 

0.7967232 

6. 3485238 

3 

CONSTANT 

DEC72 

24 .349114 

42.934934 

4 

LINEAR 

DEC72 

0.0612253 

0.0728315 

5 

AROl 

DEC7  2 

0. 6101027 

6 

AR02 

DEC7  2 

7 

AR03 

DEC7  2 

8 

AR04 

DEC72 

9 

AR05 

DEC72 

10 

AR06 

DEC72 

11 

AR07 

DEC72 

0.2069177 

12 

AR08 

DEC72 

13 

AR09 

DEC72 

14 

ARIO 

DEC72 

15 

ARll 

DEC72 

-0. 197647 

16 

AR12 

DEC7  2 

0.5571647 

17 

AR13 

DEC72 

-0.498052 
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The  CRUDE  variable  has  several  autoregressive  terms,  there  is  an  indication  of  seasonality  as 
shown  by  the  significance  of  the  terms  around  AR12. 


The  output  dataset  can  also  be  plotted  with  actual  data,  forecast  and  confidence  limits: 


proc  gplot  data=b; 

symboll  i=join  c=red  L=l  r=l 
symbol2  i=join  c=green  L=2  r=l 
syTnbol3  i=join  c=cyan  L=3  r=l 
symbol4  i=join  c=yellow  I>=4  r=l 
symbols  i=join  c=Blue  Lr=5  r=l 
plot  crude*date=_type_; 
title  'Plot  of  Crude  Oil  Use'; 

run; 


Plot  of  Crude  Oil  Use 

35- 


21 


SEP62  JUN65  MAR68  0EC70  SEP73  JUN76 

DATE 

Type  of  Observation   ACTUAL   FORECAST 

 —  i   —  U3f)  ^  
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SAS/QC  (Quality  Control) 


SAS/QC  is  an  add-on  package  to  SAS/PC  designed  for  statistical  quality  control  and  experimental 
design.  There  are  several  applications  for  the  product  in  manufacturing  and  laboratory.  It 
features  several  graphics  and  analytical  procedures  such  as: 

CAPABILITY  Process  capability  analysis  (including  histograms  with  Gamma, 

Weibull  and  lognormal  distribution) 
CUSUM  Cumulative  sum  control  charts 

FACTEX  Orthogonal  fractional  factorial  analysis 

MACONTROL  Moving  average  control  charts 

OPTEX  Finding  of  optimal  design 

SHEWHART  Shewhart  charts  (MEAN  X  AND  Range  charts) 

ADX  Macros  for  design  and  analysis  of  experiments  (menu  driven) 

ISHIKAWA  Cause  and  effect  diagrams 

A  sample  use  for  PROC  SHEWHART 


The  following  was  taken  from  the  sample  library.  It  tests  and  plots  means  and  standard 
deviation. 


data  lengstat; 

input  day  mean  std  n; 
informat  day  date?.  ; 
format  day  dates.  ; 

label  day  -'Date  of  Sample  Collection' 
mean"' Average  Length' 
std  »• Standard  Deviation  of  Length' 
n      "'Subgroup  Sample  Size'; 

cards ; 


02JAN86  115.39 

5. 

67 

20 

03JAN86  113.68 

2. 

96 

20 

04JAN86  114.69 

5. 

45 

20 

.  -  -  more  da'ta  lines 

19JAN86  115.51 

5. 

25 

20 

20JAN86  113.63 

3. 

17 

20 

data  lengstat; 
set  lengstat; 

rename  mean«lengthx    /*  subgroup  mean  */ 
std  -lengths    /*  subgroup  std.  */ 
n      =lengthn;  /*  subgroup  sample  size  */ 
proc  shewhart  history»lengstat  graphics; 
xschart  length*day» • * • ; 
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SAS/flC  Uiflt 


The  charts  show  that  this  process  is  not  in  statistical  control  since  the  standard  deviation  (bottom) 
of  the  measurement  exceeds  the  upper  control  limit. 

SAS/IML  (matrix  pmyramwiiig) 

The  interactive  matrix  language  allows  more  direct  programming  and  array  processing.  It  can 
be  used  for  statistical  applications  such  as: 

-  Correlation 

-  Solving  non-linear  equations 
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-  Regression 

-  Alpha  factor  analysis 

-  Categorical  linear  models 

-  Response  surface  analysis 

-  Logistic  and  Probit  regression 

-  Linear  programming 

-  And  many  other  applications 

SAS/IML  can  be  used  to  replace  the  APL  language  without  use  of  a  special  keyboard.  It  has  the 
advantage  of  being  interactive,  unlike  the  DATA  step.  Data  read  in  matrices  can  be  converted  to 
SAS  datasets  and  SAS  datasets  can  be  converted  to  matrices. 

Correlation  example  with  IML 

The  following  program,  taken  from  the  sample  library  supplied  with  SAS/IML  shows  the  use  of 
matrix  language  for  a  simple  correlation. 


PROC  IML; 

/*  CORRELATION  */ 

START  CORR; 

N=NROW(X) ; 

/*   DIMENSION  OF  X  */ 

SUM=X[+,];                /*  COLUMN 

SUMS   BY  REDUCING  ROWS  */ 

XPX=X^*X-SUM^*SUM/N; 

/*  CROSS PRODUCTS  */ 

S=DIAG(1/SQRT(VECDIAG(XPX) ) ) ;  /*  SCALING  MATRIX*/ 

CORR=S*XPX*S; 

/*  CORRELATION  MATRIX*/ 

PRINT  "Correlation  Matrix" 

, ,CORR[ROWNAME=NM  COLNAME=NM] ; 

FINISH; 

/  *  STANDARDI Z  ATION  * 

/ 

START  STD; 

MEAN=X[+, ]/N; 

/*  MEANS  FOR  COLUMNS 

*/ 

X=X-REPEAT(MEAN,N, 1) ; 

/*  CENTER  X  TO  MEAN  ZERO 

*/ 

SS=.X[##,  ]  ; 

/*  SUM  OF  SQUARES  FOR  COLUMNS 

*/ 

STD»SQRT(SS/(N-1) ) ; 

/*  STANDARD  DEVIATION  ESTIMATE*/ 

X=X*DIAG(1/STD) ; 

/*  SCALING  TO  STD  DEV  1 

*/ 

PRINT  /'Standardized  Data" 

, ,   X[COLNAME=NM] ; 

FINISH; 

/*  SAMPLE  RUN  */ 

X  »  {   1     2  2, 

2     2  1, 

4     2  1, 

0     4  1, 

24     1  0, 

1     3     8 )  ; 

NM-{AGE  WEIGHT  HEIGHT); 

RUN  CORR; 

RUN  STD; 
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Two  matrices  are  produced:  a  Correlation  matrix  and  a  Standardized  data  matrix: 


Correlation  Matrix 


CORK  AGE         WEIGHT  HEIGHT 

AGE  1   -0.717102  -0.436558 

WEIGHT  -0.717102                    1  0.3508232 

HEIGHT  -0.436558   0.3508232  1 


Standardized  Data 


X                AGE  WEIGHT  HEIGHT 

-0.490116  -0.322749  0.2264554 

-0.272287  -0.322749  -0.452911 

-0.163372  -0.322749  -0.452911 

-0.59903  1.6137431  -0.452911 

2.0149206  -1.290994  -0.792594 

-0.490116  0.6454972  1.924871 


SAS/IML  also  contains  a  number  of  routines  for  displaying  data  which  give  a  greater  amount  of 
control  over  graphs  than  with  SAS/Graph  alone. 


Summary 


SAS  has  a  complete  set  of  tools  for  the  statistical  analysis  of  any  source  of  data.  The  challenge  is 
knowing  which  procedure  to  use  under  specific  conditions.  Since  it  will  almost  always  produce  an 
output,  a  good  understanding  of  statistics  is  required  for  interpretation. 


The  system  has  a  steep  learning  curve  for  those  wanting  to  use  SAS  in  its  raw  form  but  it  can  be 
'packaged'  as  an  automated  system  with  the  use  of  menus  and  sample  programs. 
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Appendix  A 


Program 
Thursday,  October  21,  1990 


TIME 

LOCATION 

0900-0915 

AUD 

0915-1000 

AUD 

1000-1030 
1030-1115 
AUD 

1115-1200 
AUD 

1200-1300 

TIME 

LOCATION 

1300-1400 
LMR 

CR 

BDR 

BDR 

1400-1500 
CR 

LMR 

BDR 

BDR 

1500-1515 

1515-1615 
BDR 


TOPIC 

Opening  Remarks 

Applying  Statistics  to  Practical  Problems 
Milton  Weiss 

BREAK 
Time  Series  Analysis 
Victor  Adamowicz 

Multivariate  Methods 
Dave  Jobson 

LUNCH-ARC  (Catered) 

GROUP  TOPIC 

A  Statistical  Graphics 

B  Discriminant  Analysis 

C  Sampling  Insect  Populations 

D  Sampling  Insect  Populations 

A  Repeated  Measures  ANOVA 

B  Statistical  Graphics 

C  Discriminant  Analysis 

D  Discriminant  Analysis 

BREAK 

A        Sampling  Insect  Populations 
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Appendix  A-continued 


TIME 

LOCATION 
0900-0950 
AUD 
0950-1015 

1015-1100 
AUD 

1100^1130 
AUD 

1130-1200 
AUD 

1200-1300 

TIME 

LOCATION 

1300-1400 
CR 

CR 

AUD 

AUD 

1400-1500 
AUD 

AUD 

CR 

CR 

1500-1515 
Auditorium 


Program 
Friday,  October  22,  1990 

TOPIC 

Experimental  Design 
Zack  Florence 
BREAK 

Parametric  Assumptions 
Robert  "Bob"  Hardin 

Statistical  Problems  in  Compliance  Assessment 
Albert  Liem 

Environmental  Chemical  Analysis 
Ian  Johnson  and  Yogesh  Kumar 

LUNCH-ARC  (Catered) 

GROUP  TOPIC 

A  Response  Surface  Analysis 

B  Response  Surface  Analysis 

C  SAS  Applications 

D  SAS  Applications 

A  SAS  Applications 

B  SAS  Applications 

C  Response  Surface  Analysis 

D  Response  Surface  Analysis 

WRAP-UP 


LEGEND 

AUD=  Auditorium 

LMR=  Library  Meeting  Room,  Main  Floor 

BDR=  Board  Room 

CR=     Conference  Room 
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TIME 

LOCATION 


BDR 
LMR 
CR 

1615-1715 
CR 

BDR 

BDR 

LMR 


Appendix  A-continued 
GROUP  TOPIC 


Sampling  Insect  Populations 
Statistical  Graphics 
Repeated  Measures  ANOVA 

Discriminant  Analysis 
Repeated  Measures  ANOVA 
Repeated  Measures  ANOVA 
Statistical  Graphics 


1900  (7:00  PM) 
BUFFET  (Optional) 

Terrace  Inn,  4440-Calgary  Trail  Northbound 

LEGEND 

AUD=  Auditorium 

LMR=  Library  Meeting  Room,  Main  Floor 
BDR=  Board  Room 
CR=     Conference  Room 
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